--- layout: single_page --- {% raw %}
By Sam Lau, Joey Gonzalez, and Deb Nolan,
edited and translated by Olaf Bochmann.
This is the textbook for Data 100, the Principles and Techniques of Data Science course at UC Berkeley.
Data 100 is the upper-division, semester-long data science course that follows Data 8, the Foundations of Data Science. The reader's assumed background is detailed in the About This Book page.
The contents of this book are licensed for free consumption under the following license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
To set up the textbook for local development, see the the setup guide.
By Sam Lau, Joey Gonzalez, and Deb Nolan,
edited and translated by Olaf Bochmann.
This is the textbook for Data 100, the Principles and Techniques of Data Science course at UC Berkeley.
Data 100 is the upper-division, semester-long data science course that follows Data 8, the Foundations of Data Science. The reader's assumed background is detailed in the About This Book page.
The contents of this book are licensed for free consumption under the following license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
To set up the textbook for local development, see the the setup guide.
In this book, we will proceed as though the reader is comfortable with the knowledge presented in Data 8 or some equivalent. In particular, we will assume that the reader is familiar with the following topics (links to pages from the Data 8 textbook are given in parentheses).
In addition, we assume that the reader has taken a course in computer programming in Python, such as CS61A or some equivalent. We will not explain Python syntax except in special cases.
Finally, we assume that the reader has basic familiarity with partial derivatives, gradients, vector algebra, and matrix algebra.
This book covers topics from multiple disciplines. Unfortunately, some of these disciplines use the same notation to describe different concepts. In order to prevent headaches, we have devised notation that may differ slightly from the notation used in your discipline.
A population parameter is denoted by . The model parameter that minimizes a specified loss function is denoted by . Typically, we desire . We use the plain variable to denote a model parameter that does not minimize a particular loss function. For example, we may arbitrarily set in order to calculate a model's loss at that choice of . When using gradient descent to minimize a loss function, we use to represent the intermediate values of .
We will always use bold lowercase letters for vectors. For example, we represent a vector of population parameters using $ \boldsymbol{\theta^*} = [ \theta^*1, \theta^*2, \ldots, \theta^*n ] \boldsymbol{\hat{\theta}} = [\hat{\theta1}, \hat{\theta2}, \ldots, \hat{\thetan} ] $.
We will always use bold uppercase letters for matrices. For example, we commonly represent a data matrix using .
We will always use non-bolded uppercase letters for random variables, such as or .
When discussing the bootstrap, we use to denote the population parameter, to denote the sample test statistic, and to denote a bootstrapped test statistic.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/01'))
In data science, we use large and diverse data sets to make conclusions about the world. In this book we discuss principles and techniques of data science through the dual lens of computational and inferential thinking. Practically speaking, this involves the following process:
It is quite common for more questions and problems to emerge after the last step of this process, and we can thus repeatedly engage in this procedure to discover new characteristics of our world. This positive feedback loop is so central to our work that we call it the data science lifecycle.
If the data science lifecycle were as easy to conduct as it is to state, there would be no need for textbooks about the the subject. Fortunately, each of the steps in the lifecycle contain numerous challenges that reveal powerful and often surprising insights that form the foundation of making thoughtful decisions using data.
As in Data 8, we will begin with an example.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/01'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
The data science lifecycle involves the following general steps:
We now demonstrate this process applied to a dataset of student first names from a previous offering of Data 100. In this chapter, we proceed quickly in order to give the reader a general sense of a complete iteration through the lifecycle. In later chapters, we expand on each step in this process to develop a repertoire of skills and principles.
We would like to figure out if the student first names give us additional information about the students themselves. Although this is a vague question to ask, it is enough to get us working with our data and we can make the question more precise as we go.
Let's begin by looking at our data, the roster of student first names that we've downloaded from a previous offering of Data 100.
Don't worry if you don't understand the code for now; we introduce the libraries in more depth soon. Instead, focus on the process and the charts that we create.
import pandas as pd
students = pd.read_csv('roster.csv')
students
We can quickly see that there are some quirks in the data. For example, one of the student's names is all uppercase letters. In addition, it is not obvious what the Role column is for.
In Data 100, we will study how to identify anomalies in data and apply corrections. The differences in capitalization will cause our programs to think that 'BRYAN' and 'Bryan' are different names when they are identical for our purposes. Let's convert all names to lower case to avoid this.
students['Name'] = students['Name'].str.lower()
students
Now that our data are in a more useful format, we proceed to exploratory data analysis.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/01'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
students = pd.read_csv('roster.csv')
students['Name'] = students['Name'].str.lower()
The term Exploratory Data Analysis (EDA for short) refers to the process of discovering traits about our data that inform future analysis.
Here's the students table from the previous page:
students
We are left with a number of questions. How many students are in this roster? What does the Role column mean? We conduct EDA in order to understand our data more thoroughly.
Oftentimes, we explore the data by repeatedly posing questions as we uncover more information.
How many students are in our dataset?
print("There are", len(students), "students on the roster.")
A natural follow-up question: does this dataset contain the complete list of students? In this case, this table contains all students in one semester's offering of Data 100.
What is the meaning of the Role field?
We often example the field's data in order to understand the field itself.
students['Role'].value_counts().to_frame()
We can see here that our data contain not only students enrolled in the class at the time but also the students on the waitlist. The Role column tells us whether each student is enrolled.
What about the names? How can we summarize this field?
In Data 100 we will work with many different kinds of data, including numerical, categorical, and text data. Each type of data has its own set of tools and techniques.
A quick way to start understanding the names is to examine the lengths of the names.
sns.distplot(students['Name'].str.len(),
rug=True,
bins=np.arange(12),
axlabel="Number of Characters")
plt.xlim(0, 12)
plt.xticks(np.arange(12))
plt.ylabel('Proportion per character');
This visualization shows us that most names are between 3 and 9 characters long. This gives us a chance to check whether our data seem reasonable — if there were many names that were 1 character long we'd have good reason to re-examine our data.
Although this dataset is rather simple, we will soon see that first names alone can reveal quite a bit about our group of students.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/01'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
students = pd.read_csv('roster.csv')
students['Name'] = students['Name'].str.lower()
So far, we have asked a broad question about our data: "Do the first names of students in Data 100 tell us anything about the class?"
We have cleaned our data by converting all our names to lowercase. During our exploratory data analysis we discovered that our roster contains about 270 names of students in the class and on the waitlist. Most of our first names are between 4 and 8 characters long.
What else can we discover about our class based on their first names? We might consider a single name from our dataset:
students['Name'][5]
From this name we can infer that the student is likely a male. We can also take a guess at the student's age. For example, if we happen to know that Jerry was a very popular baby name in 1998, we might guess that this student is around twenty years old.
This thinking gives us two new questions to investigate:
In order to investigate these questions, we will need a dataset that associates names with sex and year. Conveniently, the US Social Security department hosts such a dataset online (https://www.ssa.gov/oact/babynames/index.html). Their dataset records the names given to babies at birth and is thus often referred to as the Baby Names dataset.
We will start by downloading and then loading the dataset into Python. Again, don't worry about understanding the code in this this chapter—focus instead on understanding the overall process.
import urllib.request
import os.path
data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "babynames.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
f.write(resp.read())
import zipfile
babynames = []
with zipfile.ZipFile(local_filename, "r") as zf:
data_files = [f for f in zf.filelist if f.filename[-3:] == "txt"]
def extract_year_from_filename(fn):
return int(fn[3:7])
for f in data_files:
year = extract_year_from_filename(f.filename)
with zf.open(f) as fp:
df = pd.read_csv(fp, names=["Name", "Sex", "Count"])
df["Year"] = year
babynames.append(df)
babynames = pd.concat(babynames)
babynames
It looks like the dataset contains names, the sex given to the baby, the number of babies with that name, and the year of birth for those babies. To be sure, we check the dataset description from the SSN Office (https://www.ssa.gov/oact/babynames/background.html).
All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.
All data are from a 100% sample of our records on Social Security card applications as of March 2017.
We begin by plotting the number of male and female babies born each year:
pivot_year_name_count = pd.pivot_table(
babynames, index='Year', columns='Sex',
values='Count', aggfunc=np.sum)
pink_blue = ["#E188DB", "#334FFF"]
with sns.color_palette(sns.color_palette(pink_blue)):
pivot_year_name_count.plot(marker=".")
plt.title("Registered Names vs Year Stratified by Sex")
plt.ylabel('Names Registered that Year')
The meteoric rise in babies born in the years leading up to 1920 may seem suspicious. A sentence from the quote above helps explain:
Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.
We can also see the baby boomer period quite clearly in the plot above.
Let's use this dataset to estimate the number of females and males in our class. As with our class roster, we begin by lowercasing the names:
babynames['Name'] = babynames['Name'].str.lower()
babynames
Then, we count up how many male and female babies were born in total for each name:
sex_counts = pd.pivot_table(babynames, index='Name', columns='Sex', values='Count',
aggfunc='sum', fill_value=0., margins=True)
sex_counts
To determine whether a name is more popular for male or female babies, we can compute the proportion of times the name was given to a female baby.
prop_female = sex_counts['F'] / sex_counts['All']
sex_counts['prop_female'] = prop_female
sex_counts
We can then define a function that looks up the proportion of female names given a name.
def sex_from_name(name):
if name in sex_counts.index:
prop = sex_counts.loc[name, 'prop_female']
return 'F' if prop > 0.5 else 'M'
else:
return 'Name not in dataset'
sex_from_name('sam')
In this book, we include widgets that allow the reader to interact with functions defined in the book. The widget below displays the output of sex_from_name on a reader-provided name.
Try typing in the name "josephine" and see how the inferred sex changes as more characters are entered.
interact(sex_from_name, name='sam');
We mark each name in our class roster with its most likely sex.
students['sex'] = students['Name'].apply(sex_from_name)
students
Now it is easy to estimate how many male and female students we have:
students['sex'].value_counts()
We can proceed in a similar way to estimate the age distribution of the class, mapping each name to its average age in the dataset.
def avg_year(group):
return np.average(group['Year'], weights=group['Count'])
avg_years = (
babynames
.groupby('Name')
.apply(avg_year)
.rename('avg_year')
.to_frame()
)
avg_years
As before, we define a function to lookup the average birth year using a given name. We've included a widget for the reader to try out some names. We suggest trying names that seem older (e.g. "Mary") and names that seem newer (e.g. "Beyonce").
def year_from_name(name):
return (avg_years.loc[name, 'avg_year']
if name in avg_years.index
else None)
# Generate input box for you to try some names out:
interact(year_from_name, name='fernando');
Now, we can mark each name in Data 100 with its inferred birth year.
students['year'] = students['Name'].apply(year_from_name)
students
Then, it is easy to plot the distribution of years:
sns.distplot(students['year'].dropna());
To compute the average year:
students['year'].mean()
Our class has an average age of 35 years old—nearly twice our expected age in a course for college undergraduates. Why might our estimate be so far off?
As data scientists, we often run into results that don't agree with our expectations. Our constant challenge is to determine whether surprising results are caused by an error in our procedure or by an actual, real-world phenomenon. Since there are no simple recipes to guarantee accurate conclusions, data scientists must equip themselves with guidelines and principles to reduce the likelihood of false discovery.
In this particular case, the most likely explanation for our unexpected result is that most common names have been used for many years. For example, the name John was quite popular throughout the history recorded in our data. We can confirm this by plotting the number of babies given the name "John" each year:
names = babynames.set_index('Name').sort_values('Year')
john = names.loc['john']
john[john['Sex'] == 'M'].plot('Year', 'Count')
plt.title('Frequency of "John"');
It appears that the average birth year does not provide an accurate estimate for a given person's age in general. In a few cases, however, a person's first name is quite revealing!
names = babynames.set_index('Name').sort_values('Year')
kanye = names.loc['kanye']
kanye[kanye['Sex'] == 'M'].plot('Year', 'Count')
plt.title('Frequency of "Kanye"');
In this chapter, we walk through a complete iteration of the data science lifecycle: question formulation, data manipulation, exploratory data analysis, and prediction. We expand upon each of these steps in the following chapters.
The first half of the book (chapters 1-9) broadly covers the first three steps in the lifecycle and has a strong focus on computation. The second half of the book (chapters 10-18) uses both computational and statistical thinking to cover modeling, inference, and prediction.
As a whole, this book hopes to impart the reader with the principles and techniques of data science.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/02'))
Data science would hardly be a discipline without data. It is thus of utmost importance that we begin any data analysis by understanding how our data were collected.
In this chapter we discuss data design, the process of data collection. Many well-meaning scientists have drawn premature conclusions because they were not careful enough in understanding their data design. We will use examples and simulations to justify the importance of probability sampling in data science.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/02'))
In the 1948 US Presidential election, New York Governor Thomas Dewey ran against the incumbent Harry Truman. As usual, a number of polling agencies conducted polls of voters in order to predict which candidate was more likely to win the election.
In 1936, three elections prior to 1948, the Literary Digest infamously predicted a landslide defeat for Franklin Delano Roosevelt. To make this claim, the magazine polled a sample of over 2 million people based on telephone and car registrations. As you may know, this sampling scheme suffers from sampling bias: those with telephones and cars tend to be wealthier than those without. In this case, the sampling bias was so great that the Literary Digest thought Roosevelt would only receive 43% of the popular vote when he ended up with 61% of the popular vote, a difference of almost 20% and the largest error ever made by a major poll. The Literary Digest went out of business soon after.
Determined to learn from past mistakes, the Gallup Poll used a method called quota sampling to predict the results of the 1948 election. In their sampling scheme, each interviewer polled a set number of people from each demographic class. For example, the interviews were required to interview both males and females from different ages, ethnicities, and income levels to match the demographics in the US Census. This ensured that the poll would not leave out important subgroups of the voting population.
Using this method, the Gallup Poll predicted that Thomas Dewey would earn 5% more of the popular vote than Harry Truman would. This difference was significant enough that the Chicago Tribune famously printed the headline "Dewey Defeats Truman":

As we know now, Truman ended up winning the election. In fact, he won with 5% more of the popular vote than Dewey! What went wrong with the Gallup Poll?
Although quota sampling did help pollsters reduce sampling bias, it introduced bias in another way. The Gallup Poll told its interviewers that as long as they fulfilled their quotas they could interview whomever they wished. Here's one possible explanation for why the interviewers ended up polling a disproportionate number of Republicans: at the time, Republicans were on average wealthier and more likely to live in nicer neighborhoods, making them easier to interview. This observation is supported by the fact that the Gallup Poll predicted 2-6% more Republican votes than the actual results for the 3 elections prior.
These examples highlight the importance of understanding sampling bias as much as possible during the data collection process. Both Literary Digest and Gallup Poll made the mistake of assuming their methods were unbiased when their sampling schemes were based on human judgement all along.
We now rely on probability sampling, a family of sampling methods that assigns precise probabilities to the appearance of each sample, to reduce bias as much as possible in our data collection process.
In the age of Big Data, we are tempted to deal with bias by collecting more data. After all, we know that a census will give us perfect estimates; shouldn't a very large sample give almost perfect estimates regardless of the sampling technique?
We will return to this question after discussing probability sampling methods to compare the two approaches.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/02'))
Many fundamental aspects of data science, including data design, rely on uncertain phenomenon. The laws of probability allow us to quantify this uncertainty. In this section, we will provide a quick review of important probability concepts for this course.
Suppose we toss two coins with one side labeled heads () and the other labeled tails (). We call the action of tossing the coins and observing the results our experiment. The outcome space consists of all the possible outcomes of an experiment. In this experiment, our outcome space consists of the following outcomes: .
An event is any subset of the outcome space. For example, "getting one tails and one heads" is an event for the coin toss experiment. This event consists of the outcomes . We use the notation to represent the probability of an event occurring, in this case, .
In this course, we will typically deal with outcome spaces where each outcome is equally likely to occur. These spaces have simple probability calculations. The probability of an event occurring is equal to the proportion of outcomes in the event, or the number of outcomes in the event divided by the total number of outcomes in the outcome space:
In the coin toss experiment, the events are all equally likely to occur. To calculate , we see that there are two outcomes in the event and four outcomes total:
There are three fundamental axioms of probability:
The third axiom involves mutually exclusive events.
We often want to calculate the probability that event or event occurs. For example, a polling agency might want to know the probability that a randomly selected US citizen is either 17 or 18 years old.
We state that events are mutually exclusive when at most one of them can happen. If is the event that we select an 17-year-old and is the event that we select a 18-year old, and are mutually exclusive because a person cannot be both 18 and 19 at the same time.
The probability that or occurs is simple to calculate when and are mutually exclusive:
This is the third axiom of probability, the addition rule.
In the 2010 census, 1.4% of US citizens were 17 years old and 1.5% were 18. Thus, for a randomly chosen US citizen in 2010, and , giving .
To understand this rule, consider the following Venn diagram from the Prob 140 textbook:

When events and are mutually exclusive, there are no outcomes that appear in both and . Thus, if has 5 possible outcomes and has 4 possible outcomes, we know that the event has 9 possible outcomes.
Unfortunately, when and are not mutually exclusive, the simple addition rule does not apply. Suppose we want to calculate the probability that in two coin flips, exactly one flip is heads or exactly one flip is tails.
If is the event that exactly one heads appears, since consists of the outcomes and there are four outcomes total. If is the event that exactly one tails appears, since consists of the outcomes . However, the event only contains two outcomes since the outcomes are the same in both and : . Blindly applying the addition rule results in an incorrect conclusion. There are outcomes that appear in both and ; adding the number of outcomes in and counts them twice. In the image below, we illustrate the fact that the overlapping region of the Venn diagram gets shaded twice, once by and once by .

To compensate for this overlap, we need to subtract out the the probability that both and occur. To calculate the probability that either of two non-mutually exclusive events occurs, we use:
Notice that when and are mutually exclusive, and the equation simplifies to the addition rule above.
We also often wish to calculate probability that both event and occur. Suppose we have a class with only three students, whom we'll label as , , and . What is the probability that drawing a sample of size 2 without replacement results in , then ?
One simple way to calculate this probability is to enumerate the entire outcome space:
Since there our event consists of only one outcome:
We can also calculate this probability by noticing that this event can be broken down into two events that happen in sequence, the event of drawing as the first student, then the event of drawing after drawing .
The probability of drawing as the first student is since there are three outcomes in the outcome space (, , and ) and our event only contains one.
After drawing , the outcome space of the second event only contains . Thus, the probability of drawing as the second student is . This probability is called the conditional probability of drawing second given that was drawn first. We use the notation to describe the conditional probability of an event occurring given event occurs.
Now, observe that:
This happens to be a general rule of probability, the multiplication rule. For any events and , the probability that both and occur is:
In certain cases, two events may be independent. That is, the probability that occurs does not change after occurs. For example, the event that a 6 occurs on a dice roll does not affect the probability that a 5 occurs on the next dice roll. If and are independent, . This simplifies our multiplication rule:
Although this simplification is extremely convenient for calculation, many real-world events are not independent even if their relationship is not obvious at a first glance. For example, the event that a randomly selected US citizen is over 90 years old is not independent from the event that the citizen is male—given that the person is over 90, the person is almost twice as likely to be female than male.
As data scientists, we must examine assumptions of independence very closely! The US housing crash of 2008 might have been avoided if the bankers did not assume that housing markets in different cities moved independently of one another (link to an Economist article).
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/02'))
Unlike convenience sampling, probability sampling allows us to assign a precise probability to the event that we draw a particular sample. We will begin by reviewing simple random samples from Data 8, then introduce two alternative methods of probability sampling: cluster sampling and stratified sampling.
Suppose we have a population of individuals. We've given each individual a different letter from .
To take an simple random sample of size from this population, we can write each letter on a single index card, place all the cards into a hat, mix the cards well, and draw cards without looking. That is, a SRS is sampling uniformly at random without replacement.
Here are all possible samples of size 2:
There are possible samples of size from our population of . Another way to count the number of possible samples is:
Since in a SRS we sample uniformly at random, each of these samples are equally likely to be chosen:
We can also use this chance mechanism to answer other questions about the composition of the sample. For example:
Since out of of the possible samples listed above contain .
By symmetry, we can say:
Another way of computing is to recognize that for to be in the sample, we either need to draw it as the first marble or as the second marble.
In cluster sampling, we divide the population into clusters. Then, we use SRS to select clusters at random instead of individuals.
As an example, suppose we take our population of individuals and we pair each of them up: to form 3 clusters of 2 individuals. Then, we use SRS to select one cluster to produce a sample of size 2.
As before, we can compute the probability that is in our sample:
Similarly the probability that any particular person appears in our sample is . Note that this is the same as our SRS. However, we see differences when we look at the samples themselves. For example, in a SRS the chance of getting is the same as the chance of getting : . However, with this cluster sampling scheme:
Since and can never appear in the same sample if we only select one cluster.
Cluster sampling is still probability sampling since we can assign a probability to each potential sample. However, the resulting probabilities are different than using a SRS depending on how the population is clustered.
Why use cluster sampling? Cluster sampling is most useful because it makes sample collection easier. For example, it is much easier to poll towns of 100 people each than to poll thousands of people distributed across the entire US. This is the reason why many polling agencies today use forms of cluster sampling to conduct surveys.
The main downside of cluster sampling is that it tends to produce greater variation in estimation. This typically means that we take larger samples when using cluster sampling. Note that the reality is much more complicated than this, but we will leave the details to a future course on sampling techniques.
In stratified sampling, we divide the population into strata, and then produce one simple random sample per strata. In both cluster sampling and stratified sampling we split the population into groups; in cluster sampling we use a single SRS to select groups whereas in stratified sampling we use multiple SRS's, one for each group.
We can divide our population of 6 individuals into the following strata:
We use an SRS to select one individual from each strata to produce a sample of size . This gives us the following possible samples:
Again, we can compute the probability that is in our sample:
However:
since and cannot appear in the same sample.
Like cluster sampling, stratified sampling is also a probability sampling method that produces different probabilities depending on the stratification of the population. Note that like this example, the strata do not have to be the same size. For example, we can stratify the US by occupation, then take samples from each strata of size proportional to the distribution of occupations in the US — if only 0.01% of people in the US are statisticians, we can ensure that 0.01% of our sample will be composed of statisticians. A simple random sample might miss the poor statisticians altogether!
As you may have figured out, stratified sampling can perhaps be called the proper way to conduct quota sampling. It allows the researcher to ensure that subgroups of the population are well-represented in the sample without using human judgement to select the individuals in the sample. This can often result in less variation in estimation. However, stratified sampling is sometimes more difficult to accomplish because we sometimes don't know how large each strata is. In the previous example we have the advantage of the US census, but other times we are not so fortunate.
As we have seen in Data 8, probability sampling enables us to quantify our uncertainty about an estimation or prediction. It is only through this precision that we can conduct inference and hypothesis testing. Be wary when anyone gives you p-values or confidence levels without a proper explanation of their sampling techniques.
Now that we understand probability sampling let us see how the humble SRS compares against "big data".
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/02'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
As we have previously mentioned, it is tempting to do away with our long-winded bias concerns by using huge amounts of data. It is true that collecting a census will by definition produce unbiased estimations. Perhaps bias will no longer be a problem if we simply collect many data points.
Suppose we are pollsters in 2012 trying to predict the popular vote of the US presidential election, where Barack Obama ran against Mitt Romney. Since today we know the exact outcome of the popular vote, we can compare the predictions of a SRS to the predictions of a large non-random dataset, often called administrative datasets since they are often collected as part of some administrative work.
We will compare a SRS of size 400 to a non-random sample of size 60,000,000. Our non-random sample is nearly 150,000 times larger than our SRS! Since there were about 120,000,000 voters in 2012, we can think of our non-random sample as a survey where half of all voters in the US responded (no actual poll has ever surveyed more than 10,000,000 voters).
# HIDDEN
total = 129085410
obama_true_count = 65915795
romney_true_count = 60933504
obama_true = obama_true_count / total
romney_true = romney_true_count / total
# 1 percent off
obama_big = obama_true - 0.01
romney_big = romney_true + 0.01
Here's a plot comparing the proportions of the non-random sample to the true proportions. The bars labeled truth show the true proportions of votes that each candidate received. The bars labeled big show the proportions from our dataset of 60,000,000 voters.
# HIDDEN
pd.DataFrame({
'truth': [obama_true, romney_true],
'big': [obama_big, romney_big],
}, index=['Obama', 'Romney'], columns=['truth', 'big']).plot.bar()
plt.title('Truth compared to a big non-random dataset')
plt.xlabel('Candidate')
plt.ylabel('Proportion of popular vote')
plt.ylim(0, 0.75);
We can see that our large dataset is just slightly biased towards the Republican candidate, just as the Gallup Poll was in 1948. To see this effects of this bias, we simulate taking simple random samples of size 400 from the population and large non-random samples of size 60,000,000. We compute the proportion of votes for Obama in each sample and plot the distribution of proportions.
srs_size = 400
big_size = 60000000
replications = 10000
def resample(size, prop, replications):
return np.random.binomial(n=size, p=prop, size=replications) / size
srs_simulations = resample(srs_size, obama_true, replications)
big_simulations = resample(big_size, obama_big, replications)
Now, we plot the simulation results and overlay a red line indicating the true proportion of voters that voted for Obama.
bins = bins=np.arange(0.47, 0.55, 0.005)
plt.hist(srs_simulations, bins=bins, alpha=0.7, normed=True, label='srs')
plt.hist(big_simulations, bins=bins, alpha=0.7, normed=True, label='big')
plt.title('Proportion of Obama Voters for SRS and Big Data')
plt.xlabel('Proportion')
plt.ylabel('Percent per unit')
plt.xlim(0.47, 0.55)
plt.ylim(0, 50)
plt.axvline(x=obama_true, color='r', label='truth')
plt.legend();
As you can see, the SRS distribution is spread out but centered around the true population proportion of Obama voters. The distribution created by the large non-random sample, on the other hand, is very narrow but not a single simulated sample produces the true population proportion. If we attempt to create confidence intervals using the non-random sample, none of them will contain the true population proportion. To make matters worse, the confidence interval will be extremely narrow because the sample is so large. We will be very sure of an ultimately incorrect estimation.
In fact, when our sampling method is biased our estimations will often first become worse as we collect more data since we will be more certain about an incorrect result. In order to make accurate estimations using an even slightly biased sampling method, the sample must be nearly as large as the population itself, a typically impractical requirement. The quality of the data matters much more than its size.
Before accepting the results of a data analysis, it pays to carefully inspect the quality of the data. In particular, we must ask the following questions:
For the curious reader interested in a deeper comparison between random and large non-random samples, we suggest watching this lecture by the statistician Xiao-Li Meng.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/03'))
Tabular data, like the datasets we have worked with in Data 8, are one of the
most common and useful forms of data for analysis. We introduce tabular data
manipulation using pandas, the standard Python library for working with
tabular data. Although pandas's syntax is more challenging to use than the
datascience package used in Data 8, pandas provides significant performance
improvements and is the current tool of choice in both industry and academia
for working with tabular data.
It is more important that you understand the types of useful operations on data
than the exact details of pandas syntax. For example, knowing when to use a
group or a join is more useful than knowing how to call the pandas function
to group data. It is relatively easy to look up the function you need once you
know the right operation to use. All of the table manipulations in this chapter
will also appear again in a new syntax when we cover SQL, so it will help you
to understand them now.
Because we will cover only the most important pandas functions in this
textbook, you should bookmark the pandas documentation for reference
when you conduct your own data analyses.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/03'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
In each section of this chapter we will work with the Baby Names dataset from Chapter 1. We will pose a question, break the question down into high-level steps, then translate each step into Python code using pandas DataFrames. We begin by importing pandas:
# pd is a common shorthand for pandas
import pandas as pd
Now we can read in the data using pd.read_csv (docs).
baby = pd.read_csv('babynames.csv')
baby
Note that for the code above to work, the babynames.csv file must be located in the same directory as this notebook. We can check what files are in the current folder by running ls in a notebook cell:
ls
When we use pandas to read in data, we get a DataFrame. A DataFrame is a tabular data structure where each column is labeled (in this case 'Name', 'Sex', 'Count', 'Year') and each row is labeled (in this case 0, 1, 2, ..., 1891893). The Table introduced in Data 8, however, only has labeled columns.
The labels of a DataFrame are called the indexes of the DataFrame and make many data manipulations easier.
Let's use pandas to answer the following question:
What were the five most popular baby names in 2016?
We can decompose this question into the following simpler table manipulations:
Now, we can express these steps in pandas.
.loc¶To select subsets of a DataFrame, we use the .loc slicing syntax. The first argument is the label of the row and the second is the label of the column:
baby
baby.loc[1, 'Name'] # Row labeled 1, Column labeled 'Name'
To slice out multiple rows or columns, we can use :. Note that .loc slicing is inclusive, unlike Python's slicing.
# Get rows 1 through 5, columns Name through Count inclusive
baby.loc[1:5, 'Name':'Count']
We will often want a single column from a DataFrame:
baby.loc[:, 'Year']
Note that when we select a single column, we get a pandas Series. A Series is like a one-dimensional NumPy array since we can perform arithmetic on all the elements at once.
baby.loc[:, 'Year'] * 2
To select out specific columns, we can pass a list into the .loc slice:
# This is a DataFrame again
baby.loc[:, ['Name', 'Year']]
Selecting columns is common, so there's a shorthand.
# Shorthand for baby.loc[:, 'Name']
baby['Name']
# Shorthand for baby.loc[:, ['Name', 'Count']]
baby[['Name', 'Count']]
To slice out the rows with year 2016, we will first create a Series containing True for each row we want to keep and False for each row we want to drop. This is simple because math and boolean operators on Series are applied to each element in the Series.
# Series of years
baby['Year']
# Compare each year with 2016
baby['Year'] == 2016
Once we have this Series of True and False, we can pass it into .loc.
# We are slicing rows, so the boolean Series goes in the first
# argument to .loc
baby_2016 = baby.loc[baby['Year'] == 2016, :]
baby_2016
The next step is the sort the rows in descending order by 'Count'. We can use the sort_values() function.
sorted_2016 = baby_2016.sort_values('Count', ascending=False)
sorted_2016
Finally, we will use .iloc to slice out the first five rows of the DataFrame. .iloc works like .loc but takes in numerical indices instead of labels. It does not include the right endpoint in its slices, like Python's list slicing.
# Get the value in the zeroth row, zeroth column
sorted_2016.iloc[0, 0]
# Get the first five rows
sorted_2016.iloc[0:5]
We now have the five most popular baby names in 2016 and learned to express the following operations in pandas:
| Operation | pandas |
|---|---|
| Read a CSV file | pd.read_csv() |
| Slicing using labels or indices | .loc and .iloc |
| Slicing rows using a predicate | Use a boolean-valued Series in .loc |
| Sorting rows | .sort_values() |
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/03'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
In this section, we will answer the question:
What were the most popular male and female names in each year?
Here's the Baby Names dataset once again:
baby = pd.read_csv('babynames.csv')
baby.head()
# the .head() method outputs the first five rows of the DataFrame
We should first notice that the question in the previous section has similarities to this one; the question in the previous section restricts names to babies born in 2016 whereas this question asks for names in all years.
We once again decompose this problem into simpler table manipulations.
baby DataFrame by 'Year' and 'Sex'.Recognizing which operation is needed for each problem is sometimes tricky. Usually, a convoluted series of steps will signal to you that there might be a simpler way to express what you want. If we didn't immediately recognize that we needed to group, for example, we might write steps like the following:
There is almost always a better alternative to looping over a pandas DataFrame. In particular, looping over unique values of a DataFrame should usually be replaced with a group.
To group in pandas. we use the .groupby() method.
baby.groupby('Year')
.groupby() returns a strange-looking DataFrameGroupBy object. We can call .agg() on this object with an aggregation function in order to get a familiar output:
# The aggregation function takes in a series of values for each group
# and outputs a single value
def length(series):
return len(series)
# Count up number of values for each year. This is equivalent to
# counting the number of rows where each year appears.
baby.groupby('Year').agg(length)
You might notice that the length function simply calls the len function, so we can simplify the code above.
baby.groupby('Year').agg(len)
The aggregation is applied to each column of the DataFrame, producing redundant information. We can restrict the output columns by slicing before grouping.
year_rows = baby[['Year', 'Count']].groupby('Year').agg(len)
year_rows
# A further shorthand to accomplish the same result:
#
# year_counts = baby[['Year', 'Count']].groupby('Year').count()
#
# pandas has shorthands for common aggregation functions, including
# count, sum, and mean.
Note that the index of the resulting DataFrame now contains the unique years, so we can slice subsets of years using .loc as before:
# Every twentieth year starting at 1880
year_rows.loc[1880:2016:20, :]
As we've seen in Data 8, we can group on multiple columns to get groups based on unique pairs of values. To do this, pass in a list of column labels into .groupby().
grouped_counts = baby.groupby(['Year', 'Sex']).sum()
grouped_counts
The code above computes the total number of babies born for each year and sex. Let's now use grouping by muliple columns to compute the most popular names for each year and sex. Since the data are already sorted in descending order of Count for each year and sex, we can define an aggregation function that returns the first value in each series. (If the data weren't sorted, we can call sort_values() first.)
# The most popular name is simply the first one that appears in the series
def most_popular(series):
return series.iloc[0]
baby_pop = baby.groupby(['Year', 'Sex']).agg(most_popular)
baby_pop
Notice that grouping by multiple columns results in multiple labels for each row. This is called a "multilevel index" and is tricky to work with. The important thing to know is that .loc takes in a tuple for the row index instead of a single value:
baby_pop.loc[(2000, 'F'), 'Name']
But .iloc behaves the same as usual since it uses indices instead of labels:
baby_pop.iloc[10:15, :]
If you group by two columns, you can often use pivot to present your data in a more convenient format. Using a pivot lets you use one set of grouped labels as the columns of the resulting table.
To pivot, use the pd.pivot_table() function.
pd.pivot_table(baby,
index='Year', # Index for rows
columns='Sex', # Columns
values='Name', # Values in table
aggfunc=most_popular) # Aggregation function
Compare this result to the baby_pop table that we computed using .groupby(). We can see that the Sex index in baby_pop became the columns of the pivot table.
baby_pop
We now have the most popular baby names for each sex and year in our dataset and learned to express the following operations in pandas:
| Operation | pandas |
|---|---|
| Group | df.groupby(label) |
| Group by multiple columns | df.groupby([label1, label2]) |
| Group and aggregate | df.groupby(label).agg(func) |
| Pivot | pd.pivot_table() |
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/03'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
In this section, we will answer the question:
Can we use the last letter of a name to predict the sex of the baby?
Here's the Baby Names dataset once again:
baby = pd.read_csv('babynames.csv')
baby.head()
# the .head() method outputs the first five rows of the DataFrame
Although there are many ways to see whether prediction is possible, we will use plotting in this section. We can decompose this question into two steps:
pandas Series contain an .apply() method that takes in a function and applies it to each value in the Series.
names = baby['Name']
names.apply(len)
To extract the last letter of each name, we can define our own function to pass into .apply():
def last_letter(string):
return string[-1]
names.apply(last_letter)
Although .apply() is flexible, it is often faster to use the built-in string manipulation functions in pandas when dealing with text data.
pandas provides access to string manipulation functions using the .str attribute of Series.
names = baby['Name']
names.str.len()
We can directly slice out the last letter of each name in a similar way.
names.str[-1]
We suggest looking at the docs for the full list of string methods (link).
We can now add this column of last letters to our baby DataFrame.
baby['Last'] = names.str[-1]
baby
To compute the sex distribution for each last letter, we need to group by both Last and Sex.
# Shorthand for baby.groupby(['Last', 'Sex']).agg(np.sum)
baby.groupby(['Last', 'Sex']).sum()
Notice that Year is also summed up since each non-grouped column is passed into the aggregation function. To avoid this, we can select out the desired columns before calling .groupby().
# When lines get long, you can wrap the entire expression in parentheses
# and insert newlines before each method call
letter_dist = (
baby[['Last', 'Sex', 'Count']]
.groupby(['Last', 'Sex'])
.sum()
)
letter_dist
pandas provides built-in plotting functionality for most basic plots, including bar charts, histograms, line charts, and scatterplots. To make a plot from a DataFrame, use the .plot attribute:
# We use the figsize option to make the plot larger
letter_dist.plot.barh(figsize=(10, 10))
Although this plot shows the distribution of letters and sexes, the male and female bars are difficult to tell apart. By looking at the pandas docs on plotting (link) we learn that pandas plots one group of bars for row column in the DataFrame, showing one differently colored bar for each column. This means that a pivoted version of the letter_dist table will have the right format.
letter_pivot = pd.pivot_table(
baby, index='Last', columns='Sex', values='Count', aggfunc='sum'
)
letter_pivot
letter_pivot.plot.barh(figsize=(10, 10))
Notice that pandas conveniently generates a legend for us as well. However, this is still difficult to interpret. We plot the counts for each letter and sex which causes some bars to appear very long and others to be almost invisible. We should instead plot the proportion of male and female babies within each last letter.
total_for_each_letter = letter_pivot['F'] + letter_pivot['M']
letter_pivot['F prop'] = letter_pivot['F'] / total_for_each_letter
letter_pivot['M prop'] = letter_pivot['M'] / total_for_each_letter
letter_pivot
(letter_pivot[['F prop', 'M prop']]
.sort_values('M prop') # Sorting orders the plotted bars
.plot.barh(figsize=(10, 10))
)
We can see that almost all first names that end in 'p' are male and names that end in 'a' are female! In general, the difference between bar lengths for many letters implies that we can often make a good guess to a person's sex if we just know the last letter of their first name.
We've learned to express the following operations in pandas:
| Operation | pandas |
|---|---|
| Applying a function elementwise | series.apply(func) |
| String manipulation | series.str.func() |
| Plotting | df.plot.func() |
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/04'))
Data come in many formats and vary greatly in usefulness for analysis. Although we would prefer all our data to come in a tabular format with each value recorded consistently and accurately, in reality we must carefully check our data for potential issues that can eventually result in incorrect conclusions.
The term "data cleaning" refers to the process of combing through the data and deciding how to resolve inconsistencies and missing values. We will discuss common problems found in datasets and approaches to address them.
Data cleaning has inherent limitations. For example, no amount of data cleaning will fix a biased sampling process. Before embarking on the sometimes lengthy process of data cleaning, we must be confident that our data are collected accurately and with as little bias as possible. Only then can we investigate the data itself and use data cleaning to resolve issues in the data format or entry process.
We will introduce data cleaning techniques by working with City of Berkeley Police Department datasets.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/04'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
We will use the Berkeley Police Department's publicly available datasets to demonstrate data cleaning techniques. We have downloaded the Calls for Service dataset and Stops dataset.
We can use the ls shell command with the -lh flags to see more details about the files:
!ls -lh data/
The command above shows the data files and their file sizes. This is especially useful because we now know the files are small enough to load into memory. As a rule of thumb, it is usually safe to load a file into memory that is around one fourth of the total memory capacity of the computer. For example, if a computer has 4GB of RAM we should be able to load a 1GB CSV file in pandas. To handle larger datasets we will need additional computational tools that we will cover later in this book.
Notice the use of the exclamation point before ls. This tells Jupyter that the next line of code is a shell command, not a Python expression. We can run any available shell command in Jupyter using !:
# The `wc` shell command shows us how many lines each file has.
# We can see that the `stops.json` file has the most lines (29852).
!wc -l data/*
We will state important questions you should ask of all datasets before data cleaning or processing. These questions are related to how the data were generated, so data cleaning will usually not be able to resolve issues that arise here.
What do the data contain? The website for the Calls for Service data states that the dataset describes "crime incidents (not criminal reports) within the last 180 days". Further reading reveals that "not all calls for police service are included (e.g. Animal Bite)".
The website for the Stops data states that the dataset contains data on all "vehicle detentions (including bicycles) and pedestrian detentions (up to five persons)" since January 26, 2015.
Are the data a census? This depends on our population of interest. For example, if we are interested in calls for service within the last 180 days for crime incidents then the Calls dataset is a census. However, if we are interested in calls for service within the last 10 years the dataset is clearly not a census. We can make similar statements about the Stops dataset since the data collection started on January 26, 2015.
If the data form a sample, is it a probability sample? If we are investigating a period of time that the data do not have entries for, the data do not form a probability sample since there is no randomness involved in the data collection process — we have all data for certain time periods but no data for others.
What limitations will this data have on our conclusions? Although we will ask this question at each step of our data processing, we can already see that our data impose important limitations. The most important limitation is that we cannot make unbiased estimations for time periods not covered by our datasets.
Let's now clean the Calls dataset. The head shell command prints the first five lines of the file.
!head data/Berkeley_PD_-_Calls_for_Service.csv
It appears to be a comma-separated values (CSV) file, though it's hard to tell whether the entire file is formatted properly. We can use pd.read_csv to read in the file as a DataFrame. If pd.read_csv errors, we will have to dig deeper and manually resolve formatting issues. Fortunately, pd.read_csv successfully returns a DataFrame:
calls = pd.read_csv('data/Berkeley_PD_-_Calls_for_Service.csv')
calls
We can define a function to show different slices of the data and then interact with it:
def df_interact(df):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + 5, col:col + 6]
interact(peek, row=(0, len(df), 5), col=(0, len(df.columns) - 6))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
df_interact(calls)
Based on the output above, the resulting DataFrame looks reasonably well-formed since the columns are properly named and the data in each column seems to be entered consistently. What data does each column contain? We can look at the dataset website:
| Column | Description | Type |
|---|---|---|
| CASENO | Case Number | Number |
| OFFENSE | Offense Type | Plain Text |
| EVENTDT | Date Event Occurred | Date & Time |
| EVENTTM | Time Event Occurred | Plain Text |
| CVLEGEND | Description of Event | Plain Text |
| CVDOW | Day of Week Event Occurred | Number |
| InDbDate | Date dataset was updated in the portal | Date & Time |
| Block_Location | Block level address of event | Location |
| BLKADDR | Plain Text | |
| City | Plain Text | |
| State | Plain Text |
On the surface the data looks easy to work with. However, before starting data analysis we must answer the following questions:
Although there are plenty more checks to go through, these three will suffice for many cases. See the Quartz bad data guide for a more complete list of checks.
This is a simple check in pandas:
# True if row contains at least one null value
null_rows = calls.isnull().any(axis=1)
calls[null_rows]
It looks like 27 calls didn't have a recorded address in BLKADDR. Unfortunately, the data description isn't very clear on how the locations were recorded. We know that all of these calls were made for events in Berkeley, so we can likely assume that the addresses for these calls were originally somewhere in Berkeley.
From the missing value check above we can see that the Block_Location column has Berkeley, CA recorded if the location was missing.
In addition, an inspection of the calls table shows us that the EVENTDT column has the correct dates but records 12am for all of its times. Instead, the times are in the EVENTTM column.
# Show the first 7 rows of the table again for reference
calls.head(7)
As a data cleaning step, we want to merge the EVENTDT and EVENTTM columns to record both date and time in one field. If we define a function that takes in a DF and returns a new DF, we can later use pd.pipe to apply all transformations in one go.
def combine_event_datetimes(calls):
combined = pd.to_datetime(
# Combine date and time strings
calls['EVENTDT'].str[:10] + ' ' + calls['EVENTTM'],
infer_datetime_format=True,
)
return calls.assign(EVENTDTTM=combined)
# To peek at the result without mutating the calls DF:
calls.pipe(combine_event_datetimes).head(2)
It looks like most of the data columns are machine-recorded, including the date, time, day of week, and location of the event.
In addition, the OFFENSE and CVLEGEND columns appear to contain consistent values. We can check the unique values in each column to see if anything was misspelled:
calls['OFFENSE'].unique()
calls['CVLEGEND'].unique()
Since each value in these columns appears to be spelled correctly, we won't have to perform any corrections on these columns.
We also check the BLKADDR column for inconsistencies and find that sometimes an address is recorded (e.g. 2500 LE CONTE AVE) but other times a cross street is recorded (e.g. ALLSTON WAY & FIFTH ST). This suggests that a human entered this data in and this column will be difficult to use for analysis. Fortunately we can use the latitude and longitude of the event instead of the street address.
calls['BLKADDR'][[0, 5001]]
This dataset seems almost ready for analysis. The Block_Location column seems to contain strings that record address, latitude, and longitude. We will want to separate the latitude and longitude for easier use.
def split_lat_lon(calls):
return calls.join(
calls['Block_Location']
# Get coords from string
.str.split('\n').str[2]
# Remove parens from coords
.str[1:-1]
# Split latitude and longitude
.str.split(', ', expand=True)
.rename(columns={0: 'Latitude', 1: 'Longitude'})
)
calls.pipe(split_lat_lon).head(2)
Then, we can match the day of week number with its weekday:
# This DF contains the day for each number in CVDOW
day_of_week = pd.read_csv('data/cvdow.csv')
day_of_week
def match_weekday(calls):
return calls.merge(day_of_week, on='CVDOW')
calls.pipe(match_weekday).head(2)
We'll drop columns we no longer need:
def drop_unneeded_cols(calls):
return calls.drop(columns=['CVDOW', 'InDbDate', 'Block_Location', 'City',
'State', 'EVENTDT', 'EVENTTM'])
Finally, we'll pipe the calls DF through all the functions we've defined:
calls_final = (calls.pipe(combine_event_datetimes)
.pipe(split_lat_lon)
.pipe(match_weekday)
.pipe(drop_unneeded_cols))
df_interact(calls_final)
The Calls dataset is now ready for further data analysis. In the next section, we will clean the Stops dataset.
# HIDDEN
# Save data to CSV for other chapters
# calls_final.to_csv('../ch5/data/calls.csv', index=False)
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/04'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0):
return df[row:row + 5]
interact(peek, row=(0, len(df), 5))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
!head data/stops.json
The stops.json file is clearly not a CSV file. In this case, the file contains data in the JSON (JavaScript Object Notation) format, a commonly used data format where data is recorded in a dictionary format. Python's json module makes reading in this file as a dictionary simple.
import json
# Note that this could cause our computer to run out of memory if the file
# is large. In this case, we've verified that the file is small enough to
# read in beforehand.
with open('data/stops.json') as f:
stops_dict = json.load(f)
stops_dict.keys()
Note that stops_dict is a Python dictionary, so displaying it will display the entire dataset in the notebook. This could cause the browser to crash, so we only display the keys of the dictionary above. To peek at the data without potentially crashing the browser, we can print the dictionary to a string and only output some of the first characters of the string.
from pprint import pformat
def print_dict(dictionary, num_chars=1000):
print(pformat(dictionary)[:num_chars])
print_dict(stops_dict['meta'])
print_dict(stops_dict['data'], num_chars=300)
We can likely deduce that the 'meta' key in the dictionary contains a description of the data and its columns and the 'data' contains a list of data rows. We can use this information to initialize a DataFrame.
# Load the data from JSON and assign column titles
stops = pd.DataFrame(
stops_dict['data'],
columns=[c['name'] for c in stops_dict['meta']['view']['columns']])
stops
# Prints column names
stops.columns
The website contains documentation about the following columns:
| Column | Description | Type |
|---|---|---|
| Incident Number | Number of incident created by Computer Aided Dispatch (CAD) program | Plain Text |
| Call Date/Time | Date and time of the incident/stop | Date & Time |
| Location | General location of the incident/stop | Plain Text |
| Incident Type | This is the occurred incident type created in the CAD program. A code signifies a traffic stop (T), suspicious vehicle stop (1196), pedestrian stop (1194) and bicycle stop (1194B). | Plain Text |
| Dispositions | Ordered in the following sequence: 1st Character = Race, as follows: A (Asian) B (Black) H (Hispanic) O (Other) W (White) 2nd Character = Gender, as follows: F (Female) M (Male) 3rd Character = Age Range, as follows: 1 (Less than 18) 2 (18-29) 3 (30-39), 4 (Greater than 40) 4th Character = Reason, as follows: I (Investigation) T (Traffic) R (Reasonable Suspicion) K (Probation/Parole) W (Wanted) 5th Character = Enforcement, as follows: A (Arrest) C (Citation) O (Other) W (Warning) 6th Character = Car Search, as follows: S (Search) N (No Search) Additional dispositions may also appear. They are: P - Primary case report M - MDT narrative only AR - Arrest report only (no case report submitted) IN - Incident report FC - Field Card CO - Collision investigation report MH - Emergency Psychiatric Evaluation TOW - Impounded vehicle 0 or 00000 – Officer made a stop of more than five persons | Plain Text |
| Location - Latitude | General latitude of the call. This data is only uploaded after January 2017 | Number |
| Location - Longitude | General longitude of the call. This data is only uploaded after January 2017. | Number |
Notice that the website doesn't contain descriptions for the first 8 columns of the stops table. Since these columns appear to contain metadata that we're not interested in analyzing this time, we drop them from the table.
columns_to_drop = ['sid', 'id', 'position', 'created_at', 'created_meta',
'updated_at', 'updated_meta', 'meta']
# This function takes in a DF and returns a DF so we can use it for .pipe
def drop_unneeded_cols(stops):
return stops.drop(columns=columns_to_drop)
stops.pipe(drop_unneeded_cols)
As with the Calls dataset, we will answer the following three questions about the Stops dataset:
We can clearly see that there are many missing latitude and longitudes. The data description states that these two columns are only filled in after Jan 2017.
# True if row contains at least one null value
null_rows = stops.isnull().any(axis=1)
stops[null_rows]
We can check the other columns for missing values:
# True if row contains at least one null value without checking
# the latitude and longitude columns
null_rows = stops.iloc[:, :-2].isnull().any(axis=1)
df_interact(stops[null_rows])
By browsing through the table above, we can see that all other missing values are in the Dispositions column. Unfortunately, we do not know from the data description why these Dispositions might be missing. Since only there are only 63 missing values compared to 25,000 rows in the original table, we can proceed with analysis while being mindful that these missing values could impact results.
It doesn't seem like any previously missing values were filled in for us. Unlike in the Calls dataset where the date and time were in separate columns, the Call Date/Time column in the Stops dataset contains both date and time.
As with the Calls dataset, it looks like most of the columns in this dataset were recorded by a machine or were a category selected by a human (e.g. Incident Type).
However, the Location column doesn't have consistently entered values. Sure enough, we spot some typos in the data:
stops['Location'].value_counts()
What a mess! It looks like sometimes an address was entered, sometimes a cross-street, and other times a latitude-longitude pair. Unfortunately, we don't have very complete latitude-longitude data to use in place of this column. We may have to manually clean this column if we want to use locations for future analysis.
We can also check the Dispositions column:
dispositions = stops['Dispositions'].value_counts()
# Outputs a slider to pan through the unique Dispositions in
# order of how often they appear
interact(lambda row=0: dispositions.iloc[row:row+7],
row=(0, len(dispositions), 7))
The Dispositions columns also contains inconsistencies. For example, some dispositions start with a space, some end with a semicolon, and some contain multiple entries. The variety of values suggests that this field contains human-entered values and should be treated with caution.
# Strange values...
dispositions.iloc[[0, 20, 30, 266, 1027]]
In addition, the most common disposition is M which isn't a permitted first character in the Dispositions column. This could mean that the format of the column changed over time or that officers are allowed to enter in the disposition without matching the format in the data description. In any case, the column will be challenging to work with.
We can take some simple steps to clean the Dispositions column by removing leading and trailing whitespace, removing trailing semi-colons, and replacing the remaining semi-colons with commas.
def clean_dispositions(stops):
cleaned = (stops['Dispositions']
.str.strip()
.str.rstrip(';')
.str.replace(';', ','))
return stops.assign(Dispositions=cleaned)
As before, we can now pipe the stops DF through the cleaning functions we've defined:
stops_final = (stops
.pipe(drop_unneeded_cols)
.pipe(clean_dispositions))
df_interact(stops_final)
As these two datasets have shown, data cleaning can often be both difficult and tedious. Cleaning 100% of the data often takes too long, but not cleaning the data at all results in faulty conclusions; we have to weigh our options and strike a balance each time we encounter a new dataset.
The decisions made during data cleaning impact all future analyses. For example, we chose not to clean the Location column of the Stops dataset so we should treat that column with caution. Each decision made during data cleaning should be carefully documented for future reference, preferably in a notebook so that both code and explanations appear together.
# HIDDEN
# Save data to CSV for other chapters
# stops_final.to_csv('../ch5/data/stops.csv', index=False)
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/05'))
Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.
In Exploratory Data Analysis (EDA), the third step of the data science lifecycle, we summarize, visualize, and transform the data in order to understand it more deeply. In particular, through EDA we identify potential issues in the data and discover trends that inform further analyses.
We seek to understand the following properties about our data:
Although we introduce data cleaning and EDA separately to help organize this book, in practice you will often switch between the two. For example, visualizing a column may show misformatted values that you should use data cleaning techniques to process. With this in mind, we return to the Berkeley Police Department datasets for exploration.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/05'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
from IPython.display import display, HTML
def display_two(df1, df2):
'''Displays two DFs side-by-side.'''
display(
HTML('<div style="display: flex;">'
'{}'
'<div style="width: 20px;"></div>'
'{}'
'</div>'.format(df1._repr_html_(), df2._repr_html_()))
)
The structure of a dataset refers to the "shape" of the data files. At a basic level, this refers to the format that the data are entered in. For example, we saw that the Calls dataset is a comma-separated values file:
!head data/Berkeley_PD_-_Calls_for_Service.csv
The Stops dataset, on the other hand, is a JSON (JavaScript Object Notation) file.
# Show first and last 5 lines of file
!head -n 5 data/stops.json
!echo '...'
!tail -n 5 data/stops.json
Of course, there are many other types of data formats. Here is a list of the most common formats:
\t) for TSV. These files are typically easy to work with because the data are entered in a similar format to DataFrames.eXtensible Markup Language (XML) or HyperText Markup Language (HTML). These files also contain data in a nested format, for example:
<?xml version="1.0" encoding="UTF-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
In a later chapter we will use XPath to extract data from these types of files.
Log data. Many applications will output some data as they run in an unstructured text format, for example:
2005-03-23 23:47:11,663 - sa - INFO - creating an instance of aux_module.Aux
2005-03-23 23:47:11,665 - sa.aux.Aux - INFO - creating an instance of Aux
2005-03-23 23:47:11,665 - sa - INFO - created an instance of aux_module.Aux
2005-03-23 23:47:11,668 - sa - INFO - calling aux_module.Aux.do_something
2005-03-23 23:47:11,668 - sa.aux.Aux - INFO - doing something
In a later chapter we will use Regular Expressions to extract data from these types of files.
Data will often be split across multiple tables. For example, one table can describe some people's personal information while another will contain their emails:
people = pd.DataFrame(
[["Joey", "blue", 42, "M"],
["Weiwei", "blue", 50, "F"],
["Joey", "green", 8, "M"],
["Karina", "green", 7, "F"],
["Nhi", "blue", 3, "F"],
["Sam", "pink", -42, "M"]],
columns = ["Name", "Color", "Number", "Sex"])
people
email = pd.DataFrame(
[["Deb", "deborah_nolan@berkeley.edu"],
["Sam", "samlau95@berkeley.edu"],
["John", "doe@nope.com"],
["Joey", "jegonzal@cs.berkeley.edu"],
["Weiwei", "weiwzhang@berkeley.edu"],
["Weiwei", "weiwzhang+123@berkeley.edu"],
["Karina", "kgoot@berkeley.edu"]],
columns = ["User Name", "Email"])
email
To match up each person with his or her email, we can join the two tables on the columns that contain the usernames. We must then decide what to do about people that appear in one table but not the other. For example, Fernando appears in the people table but not the email table. We have several types of joins for each strategy of matching missing values. One of the more common joins is the inner join, where any row that doesn't have a match is dropped in the final result:
# Fernando, Nhi, Deb, and John don't appear
people.merge(email, how='inner', left_on='Name', right_on='User Name')
There are four basic joins that we use most often: inner, full (sometimes called "outer"), left, and right joins. Below is a diagram to show the difference between these types of joins.

Use the dropdown menu below to show the result of the four different types of joins on the people and email tables. Notice which rows contain NaN values for outer, left, and right joins.
# HIDDEN
def join_demo(join_type):
display(HTML('people and email tables:'))
display_two(people, email)
display(HTML('<br>'))
display(HTML('Joined table:'))
display(people.merge(email, how=join_type,
left_on='Name', right_on='User Name'))
interact(join_demo, join_type=['inner', 'outer', 'left', 'right']);
You should have answers to the following questions after looking at the structure of your datasets. We will answer them for the Calls and Stops datasets.
Are the data in a standard format or encoding?
Standard formats include:
The Calls dataset came in the CSV format while the Stops dataset came in the JSON format.
Are the data organized in records (e.g. rows)? If not, can we define records by parsing the data?
The Calls dataset came in rows; we extracted records from the Stops dataset.
Are the data nested? If so, can we reasonably unnest the data?
The Calls dataset wasn't nested; we didn't have to work too hard to unnest data from the Stops dataset.
Do the data reference other data? If so, can we join the data?
The Calls dataset references the day of week table. Joining those two tables gives us the day of week for each incident in the dataset. The Stops dataset had no obvious references.
What are the fields (e.g. columns) in each record? What is the type of each column?
The fields for the Calls and Stops datasets are described in the Data Cleaning sections for each dataset.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/05'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
The granularity of your data is what each record in your data represents. For example, in the Calls dataset each record represents a single case of a police call.
# HIDDEN
calls = pd.read_csv('data/calls.csv')
calls.head()
In the Stops dataset, each record represents a single incident of a police stop.
# HIDDEN
stops = pd.read_csv('data/stops.csv', parse_dates=[1], infer_datetime_format=True)
stops.head()
On the other hand, we could have received the Stops data in the following format:
# HIDDEN
(stops
.groupby(stops['Call Date/Time'].dt.date)
.size()
.rename('Num Incidents')
.to_frame()
)
In this case, each record in the table corresponds to a single date instead of a single incident. We would describe this table as having a coarser granularity than the one above. It's important to know the granularity of your data because it determines what kind of analyses you can perform. Generally speaking, too fine of a granularity is better than too coarse; while we can use grouping and pivoting to change a fine granularity to a coarse one, we have few tools to go from coarse to fine.
You should have answers to the following questions after looking at the granularity of your datasets. We will answer them for the Calls and Stops datasets.
What does a record represent?
In the Calls dataset, each record represents a single case of a police call. In the Stops dataset, each record represents a single incident of a police stop.
Do all records capture granularity at the same level? (Sometimes a table will contain summary rows.)
Yes, for both Calls and Stops datasets.
If the data were aggregated, how was the aggregation performed? Sampling and averaging are are common aggregations.
No aggregations were performed as far as we can tell for the datasets. We do keep in mind that in both datasets, the location is entered as a block location instead of a specific address.
What kinds of aggregations can we perform on the data?
For example, it's often useful to aggregate individual people to demographic groups or individual events to totals across time.
In this case, we can aggregate across various granularities of date or time. For example, we can find the most common hour of day for incidents with aggregation. We might also be able to aggregate across event locations to find the regions of Berkeley with the most incidents.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/05'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
calls = pd.read_csv('data/calls.csv', parse_dates=['EVENTDTTM'], infer_datetime_format=True)
stops = pd.read_csv('data/stops.csv', parse_dates=[1], infer_datetime_format=True)
The scope of the dataset refers to the coverage of the dataset in relation to what we are interested in analyzing. We seek to answer the following question about our data scope:
Does the data cover the topic of interest?
For example, the Calls and Stops datasets contain call and stop incidents made in Berkeley. If we are interested in crime incidents in the state of California, however, these datasets will be too limited in scope.
In general, larger scope is more useful than smaller scope since we can filter larger scope down to a smaller scope but often can't go from smaller scope to larger scope. For example, if we had a dataset of police stops in the United States we could subset the dataset to investigate Berkeley.
Keep in mind that scope is a broad term not always used to describe geographic location. For example, it can also refer to time coverage — the Calls dataset only contains data for a 180 day period.
We will often address the scope of the dataset during the investigation of the data generation process and confirm the dataset's scope during EDA. Let's confirm the geographic and time scope of the Calls dataset.
calls
# Shows earliest and latest dates in calls
calls['EVENTDTTM'].dt.date.sort_values()
calls['EVENTDTTM'].dt.date.max() - calls['EVENTDTTM'].dt.date.min()
The table contains data for a time period of 179 days which is close enough to the 180 day time period in the data description that we can suppose there were no calls on either April 14st, 2017 or August 29, 2017.
To check the geographic scope, we can use a map:
import folium # Use the Folium Javascript Map Library
import folium.plugins
SF_COORDINATES = (37.87, -122.28)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
locs = calls[['Latitude', 'Longitude']].astype('float').dropna().as_matrix()
heatmap = folium.plugins.HeatMap(locs.tolist(), radius = 10)
sf_map.add_child(heatmap)
With a few exceptions, the Calls dataset covers the Berkeley area. We can see that most police calls happened in the Downtown Berkeley and south of UC Berkeley campus areas.
Let's now confirm the temporal and geographic scope for the Stops dataset:
stops
stops['Call Date/Time'].dt.date.sort_values()
As promised, the data collection begins on January 26th, 2015. It looks like the data were downloaded somewhere around the beginning of May 2017 since the dates stop on April 30th, 2017. Let's draw a map to see the geographic data:
SF_COORDINATES = (37.87, -122.28)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
locs = stops[['Location - Latitude', 'Location - Longitude']].astype('float').dropna().as_matrix()
heatmap = folium.plugins.HeatMap(locs.tolist(), radius = 10)
sf_map.add_child(heatmap)
We can confirm that the police stops in the dataset happened in Berkeley, and that most police calls happened in the Downtown Berkeley and West Berkeley areas.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/05'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
Temporality refers to how the data are situated in time and specifically to the date and time fields in the dataset. We seek to understand the following traits about these fields:
What is the meaning of the date and time fields in the dataset?
In the Calls and Stops dataset, the datetime fields represent when the call or stop was made by the police. However, the Stops dataset also originally had a datetime field recording when the case was entered into the database which we took out during data cleaning since we didn't think it would be useful for analysis.
In addition, we should be careful to note the timezone and Daylight Savings for datetime fields especially when dealing with data that comes from multiple locations.
What representation do the date and time fields have in the data?
Although the US uses the MM/DD/YYYY format, many other countries use the DD/MM/YYYY format. There are still more formats in use around the world and it's important to recognize these differences when analyzing data.
In the Calls and Stops dataset, the dates came in the MM/DD/YYYY format.
Are there strange timestamps that might represent null values?
Some programs use placeholder datetimes instead of null values. For example, Excel's default date is Jan 1st, 1990 and on Excel for Mac, it's Jan 1st, 1904. Many applications will generate a default datetime of 12:00am Jan 1st, 1970 or 11:59pm Dec 31st, 1969 since this is the Unix Epoch for timestamps. If you notice multiple instances of these timestamps in your data, you should take caution and double check your data sources. Neither Calls nor Stops dataset contain any of these suspicious values.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/05'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
We describe a dataset as "faithful" if we believe it accurately captures reality. Typically, untrustworthy datasets contain:
Unrealistic or incorrect values
For example, dates in the future, locations that don't exist, negative counts, or large outliers.
Violations of obvious dependencies
For example, age and birthday for individuals don't match.
Hand-entered data
As we have seen, these are typically filled with spelling errors and inconsistencies.
Clear signs of data falsification
For example, repeated names, fake looking email addresses, or repeated use of uncommon names or fields.
Notice the many similarities to data cleaning. As we have mentioned, we often go back and forth between data cleaning and EDA, especially when determining data faithfulness. For example, visualizations often help us identify strange entries in the data.
calls = pd.read_csv('data/calls.csv')
calls.head()
calls['CASENO'].plot.hist(bins=30)
Notice the unexpected clusters at 17030000 and 17090000. By plotting the distribution of case numbers, we can quickly see anomalies in the data. In this case, we might guess that two different teams of police use different sets of case numbers for their calls.
Exploring the data often reveals anomalies; if fixable, we can then apply data cleaning techniques.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/06'))
There is a magic in graphs. The profile of a curve reveals in a flash a whole situation — the life history of an epidemic, a panic, or an era of prosperity. The curve informs the mind, awakens the imagination, convinces.
― Henry D. Hubbard
Data visualization is an essential tool for data science at every step of analysis, from data cleaning to EDA to communicating conclusions and predictions. Because human minds are highly developed for visual perception, a well-chosen plot can often reveal trends and anomalies in the data much more efficiently than a textual description.
To effectively use data visualizations, you must be proficient with both the
programming tools to generate plots and the principles of visualization. In
this chapter we will introduce seaborn and matplotlib, our tools of choice
for creating plots. We will also learn how to spot misleading visualizations
and how to improve visualizations using data transformations, smoothing, and
dimensionality reduction.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/06'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + 5, col:col + 8]
interact(peek, row=(0, len(df), 5), col=(0, len(df.columns) - 6))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
We generally use different types of charts to visualize quantitative (numerical) data and qualitative (ordinal or nominal) data.
For quantitative data, we most often use histograms, box plots, and scatter plots.
We can use the seaborn plotting library to create these plots in Python. We will use a dataset containing information about passengers aboard the Titanic.
# Import seaborn and apply its plotting styles
import seaborn as sns
sns.set()
# Load the dataset and drop N/A values to make plot function calls simpler
ti = sns.load_dataset('titanic').dropna().reset_index(drop=True)
# This table is too large to fit onto a page so we'll output sliders to
# pan through different sections.
df_interact(ti)
We can see that the dataset contains one row for every passenger. Each row includes the age of the passenger and the amount the passenger paid for a ticket. Let's visualize the ages using a histogram. We can use seaborn's distplot function:
# Adding a semi-colon at the end tells Jupyter not to output the
# usual <matplotlib.axes._subplots.AxesSubplot> line
sns.distplot(ti['age']);
By default, seaborn's distplot function will output a smoothed curve that roughly fits the distribution. We can also add a rugplot which marks each individual point on the x-axis:
sns.distplot(ti['age'], rug=True);
We can also plot the distribution itself. Adjusting the number of bins shows that there were a number of children on board.
sns.distplot(ti['age'], kde=False, bins=30);
Box plots are a convenient way to see where most of the data lie. Typically, we use the 25th and 75th percentiles of the data as the start and endpoints of the box and draw a line within the box for the 50th percentile (the median). We draw two "whiskers" that extend to show the the remaining data except outliers, which are marked as individual points outside the whiskers.
sns.boxplot(x='fare', data=ti);
We typically use the Inter-Quartile Range (IQR) to determine which points are considered outliers for the box plot. The IQR is the difference between the 75th percentile of the data and the 25th percentile.
lower, upper = np.percentile(ti['fare'], [25, 75])
iqr = upper - lower
iqr
Values greater than 1.5 IQR above the 75th percentile and less than 1.5 IQR below the 25th percentile are considered outliers and we can see them marked indivdiually on the boxplot above:
upper_cutoff = upper + 1.5 * iqr
lower_cutoff = lower - 1.5 * iqr
upper_cutoff, lower_cutoff
Although histograms show the entire distribution at once, box plots are often easier to understand when we split the data by different categories. For example, we can make one box plot for each passenger type:
sns.boxplot(x='fare', y='who', data=ti);
The separate box plots are much easier to understand than the overlaid histogram below which plots the same data:
sns.distplot(ti.loc[ti['who'] == 'woman', 'fare'])
sns.distplot(ti.loc[ti['who'] == 'man', 'fare'])
sns.distplot(ti.loc[ti['who'] == 'child', 'fare']);
You may have noticed that the boxplot call to make separate box plots for the who column was simpler than the equivalent code to make an overlaid histogram. Although sns.distplot takes in an array or Series of data, most other seaborn functions allow you to pass in a DataFrame and specify which column to plot on the x and y axes. For example:
# Plots the `fare` column of the `ti` DF on the x-axis
sns.boxplot(x='fare', data=ti);
When the column is categorical (the 'who' column contained 'woman', 'man', and 'child'), seaborn will automatically split the data by category before plotting. This means we don't have to filter out each category ourselves like we did for sns.distplot.
# fare (numerical) on the x-axis,
# who (nominal) on the y-axis
sns.boxplot(x='fare', y='who', data=ti);
Scatter plots are used to compare two quantitative variables. We can compare the age and fare columns of our Titanic dataset using a scatter plot.
sns.lmplot(x='age', y='fare', data=ti);
By default seaborn will also fit a regression line to our scatterplot and bootstrap the scatterplot to create a 95% confidence interval around the regression line shown as the light blue shading around the line above. In this case, the regression line doesn't seem to fit the scatter plot very well so we can turn off the regression.
sns.lmplot(x='age', y='fare', data=ti, fit_reg=False);
We can color the points using a categorical variable. Let's use the who column once more:
sns.lmplot(x='age', y='fare', hue='who', data=ti, fit_reg=False);
From this plot we can see that all passengers below the age of 18 or so were marked as child. There doesn't seem to be a noticable split between male and female passenger fares, although the two most expensive tickets were purchased by males.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/06'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + 5, col:col + 8]
interact(peek, row=(0, len(df), 5), col=(0, len(df.columns) - 6))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
For qualitative or categorical data, we most often use bar charts and dot charts. We will show how to create these plots using seaborn and the Titanic survivors dataset.
# Import seaborn and apply its plotting styles
import seaborn as sns
sns.set()
# Load the dataset
ti = sns.load_dataset('titanic').reset_index(drop=True)
# This table is too large to fit onto a page so we'll output sliders to
# pan through different sections.
df_interact(ti)
In seaborn, there are two types of bar charts. The first type uses the countplot method to count up the number of times each category appears in a column.
# Counts how many passengers survived and didn't survive and
# draws bars with corresponding heights
sns.countplot(x='alive', data=ti);
sns.countplot(x='class', data=ti);
# As with box plots, we can break down each category further using color
sns.countplot(x='alive', hue='class', data=ti);
The barplot method, on the other hand, groups the DataFrame by a categorical column and plots the height of the bars according to the average of a numerical column within each group.
# For each set of alive/not alive passengers, compute and plot the average age.
sns.barplot(x='alive', y='age', data=ti);
The height of each bar can be computed by grouping the original DataFrame and averaging the age column:
ti[['alive', 'age']].groupby('alive').mean()
By default, the barplot method will also compute a bootstrap 95% confidence interval for each averaged value, marked as the black lines in the bar chart above. The confidence intervals show that if the dataset contained a random sample of Titanic passengers, the difference between passenger age for those that survived and those that didn't is not statistically significant at the 5% significance level.
These confidence intervals take long to generate when we have larger datasets so it is sometimes useful to turn them off:
sns.barplot(x='alive', y='age', data=ti, ci=False);
Dot charts are similar to bar charts. Instead of plotting bars, dot charts mark a single point at the end of where a bar would go. We use the pointplot method to make dot charts in seaborn. Like the barplot method, the pointplot method also automatically groups the DataFrame and computes the average of a separate numerical variable, marking 95% confidence intervals as vertical lines centered on each point.
# For each set of alive/not alive passengers, compute and plot the average age.
sns.pointplot(x='alive', y='age', data=ti);
Dot charts are most useful when comparing changes across categories:
# Shows the proportion of survivors for each passenger class
sns.pointplot(x='class', y='survived', data=ti);
# Shows the proportion of survivors for each passenger class,
# split by whether the passenger was an adult male
sns.pointplot(x='class', y='survived', hue='adult_male', data=ti);
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/06'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + 5, col:col + 8]
interact(peek, row=(0, len(df), 5), col=(0, len(df.columns) - 6))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
matplotlib¶Although seaborn allows us to quickly create many types of plots, it does not give us fine-grained control over the chart. For example, we cannot use seaborn to modify a plot's title, change x or y-axis labels, or add annotations to a plot. Instead, we must use the matplotlib library that seaborn is based off of.
matplotlib provides basic building blocks for creating plots in Python. Although it gives great control, it is also more verbose—recreating the seaborn plots from the previous sections in matplotlib would take many lines of code. In fact, we can think of seaborn as a set of useful shortcuts to create matplotlib plots. Although we prefer to prototype plots in seaborn, in order to customize plots for publication we will need to learn basic pieces of matplotlib.
Before we look at our first simple example, we must activate matplotlib support in the notebook:
# This line allows matplotlib plots to appear as images in the notebook
# instead of in a separate window.
%matplotlib inline
# plt is a commonly used shortcut for matplotlib
import matplotlib.pyplot as plt
In order to create a plot in matplotlib, we create a figure, then add an axes to the figure. In matplotlib, an axes is a single chart, and figures can contain multiple axes in a tablular layout. An axes contains marks, the lines or patches drawn on the plot.
# Create a figure
f = plt.figure()
# Add an axes to the figure. The second and third arguments create a table
# with 1 row and 1 column. The first argument places the axes in the first
# cell of the table.
ax = f.add_subplot(1, 1, 1)
# Create a line plot on the axes
ax.plot([0, 1, 2, 3], [1, 3, 4, 3])
# Show the plot. This will automatically get called in a Jupyter notebook
# so we'll omit it in future cells
plt.show()
To customize the plot, we can use other methods on the axes object:
f = plt.figure()
ax = f.add_subplot(1, 1, 1)
x = np.arange(0, 10, 0.1)
# Setting the label kwarg lets us generate a legend
ax.plot(x, np.sin(x), label='sin(x)')
ax.plot(x, np.cos(x), label='cos(x)')
ax.legend()
ax.set_title('Sinusoids')
ax.set_xlabel('x')
ax.set_ylabel('y');
As a shortcut, matplotlib has plotting methods on the plt module itself that will automatically initialize a figure and axes.
# Shorthand to create figure and axes and call ax.plot
plt.plot(x, np.sin(x))
# When plt methods are called multiple times in the same cell, the
# existing figure and axes are reused.
plt.scatter(x, np.cos(x));
The plt module has analogous methods to an axes, so we can recreate one of the plots above using plt shorthands.
x = np.arange(0, 10, 0.1)
plt.plot(x, np.sin(x), label='sin(x)')
plt.plot(x, np.cos(x), label='cos(x)')
plt.legend()
# Shorthand for ax.set_title
plt.title('Sinusoids')
plt.xlabel('x')
plt.ylabel('y')
# Set the x and y-axis limits
plt.xlim(-1, 11)
plt.ylim(-1.2, 1.2);
To change properties of the plot marks themselves (e.g. the lines in the plot above), we can pass additional arguments into plt.plot.
plt.plot(x, np.sin(x), linestyle='--', color='purple');
Checking the matplotlib documentation is the easiest way to figure out which arguments are available for each method. Another way is to store the returned line object:
In [1]: line, = plot([1,2,3])
These line objects have a lot of properties you can control, here's the full list using tab-completion in IPython:
In [2]: line.set
line.set line.set_drawstyle line.set_mec
line.set_aa line.set_figure line.set_mew
line.set_agg_filter line.set_fillstyle line.set_mfc
line.set_alpha line.set_gid line.set_mfcalt
line.set_animated line.set_label line.set_ms
line.set_antialiased line.set_linestyle line.set_picker
line.set_axes line.set_linewidth line.set_pickradius
line.set_c line.set_lod line.set_rasterized
line.set_clip_box line.set_ls line.set_snap
line.set_clip_on line.set_lw line.set_solid_capstyle
line.set_clip_path line.set_marker line.set_solid_joinstyle
line.set_color line.set_markeredgecolor line.set_transform
line.set_contains line.set_markeredgewidth line.set_url
line.set_dash_capstyle line.set_markerfacecolor line.set_visible
line.set_dashes line.set_markerfacecoloralt line.set_xdata
line.set_dash_joinstyle line.set_markersize line.set_ydata
line.set_data line.set_markevery line.set_zorder
But the setp call (short for set property) can be very useful, especially
while working interactively because it contains introspection support, so you
can learn about the valid calls as you work:
In [7]: line, = plot([1,2,3])
In [8]: setp(line, 'linestyle')
linestyle: [ ``'-'`` | ``'--'`` | ``'-.'`` | ``':'`` | ``'None'`` | ``' '`` | ``''`` ] and any drawstyle in combination with a linestyle, e.g. ``'steps--'``.
In [9]: setp(line)
agg_filter: unknown
alpha: float (0.0 transparent through 1.0 opaque)
animated: [True | False]
antialiased or aa: [True | False]
...
... much more output omitted
...
In the first form, it shows you the valid values for the 'linestyle' property, and in the second it shows you all the acceptable properties you can set on the line object. This makes it easy to discover how to customize your figures to get the visual results you need.
In matplotlib, text can be added either relative to an individual axis object or to the whole figure.
These commands add text to the Axes:
set_title() - add a titleset_xlabel() - add an axis label to the x-axisset_ylabel() - add an axis label to the y-axistext() - add text at an arbitrary locationannotate() - add an annotation, with optional arrowAnd these act on the whole figure:
figtext() - add text at an arbitrary locationsuptitle() - add a titleAnd any text field can contain LaTeX expressions for mathematics, as long as
they are enclosed in $ signs.
This example illustrates all of them:
fig = plt.figure()
fig.suptitle('bold figure suptitle', fontsize=14, fontweight='bold')
ax = fig.add_subplot(1, 1, 1)
fig.subplots_adjust(top=0.85)
ax.set_title('axes title')
ax.set_xlabel('xlabel')
ax.set_ylabel('ylabel')
ax.text(3, 8, 'boxed italics text in data coords', style='italic',
bbox={'facecolor':'red', 'alpha':0.5, 'pad':10})
ax.text(2, 6, 'an equation: $E=mc^2$', fontsize=15)
ax.text(3, 2, 'unicode: Institut für Festkörperphysik')
ax.text(0.95, 0.01, 'colored text in axes coords',
verticalalignment='bottom', horizontalalignment='right',
transform=ax.transAxes,
color='green', fontsize=15)
ax.plot([2], [1], 'o')
ax.annotate('annotate', xy=(2, 1), xytext=(3, 4),
arrowprops=dict(facecolor='black', shrink=0.05))
ax.axis([0, 10, 0, 10]);
seaborn plot using matplotlib¶Now that we've seen how to use matplotlib to customize a plot, we can use the same methods to customize seaborn plots since seaborn creates plots using matplotlib behind-the-scenes.
# Load seaborn
import seaborn as sns
sns.set()
sns.set_context('talk')
# Load dataset
ti = sns.load_dataset('titanic').dropna().reset_index(drop=True)
ti.head()
We'll start with this plot:
sns.lmplot(x='age', y='fare', hue='who', data=ti, fit_reg=False);
We can see that the plot needs a title and better labels for the x and y-axes. In addition, the two people with the most expensive fares survived, so we can annotate them on our plot.
sns.lmplot(x='age', y='fare', hue='who', data=ti, fit_reg=False)
plt.title('Fare Paid vs. Age of Passenger, Colored by Passenger Type')
plt.xlabel('Age of Passenger')
plt.ylabel('Fare in USD')
plt.annotate('Both survived', xy=(35, 500), xytext=(35, 420),
arrowprops=dict(facecolor='black', shrink=0.05));
In practice, we use seaborn to quickly explore the data and then turn to matplotlib for fine-tuning once we decide on the plots to use in a paper or presentation.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/06'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
Now that we have the tools to create and alter plots, we turn to key principles for data visualization. Much like other parts of data science, it is difficult to precisely assign a number that measures how effective a specific visualization is. Still, there are general principles that make visualizations much more effective at showing trends in the data. We discuss six categories of principles: scale, conditioning, perception, transformation, context, and smoothing.
Principles of scale relate to the choice of x and y-axis used to plot the data.
In a 2015 US Congressional hearing, representative Chaffetz discussed an investigation of Planned Parenthood programs. He presented the following plot that originally appeared in a report by Americans United for Life. It compares the number of abortion and cancer screening procedures, both of which are offered by Planned Parenthood. (The full report is available at https://oversight.house.gov/interactivepage/plannedparenthood.)
What is suspicious about this plot? How many data points are plotted?

This plot violates principles of scale; it doesn't make good choices for its x and y-axis.
When we select the x and y-axis for our plot, we should keep a consistent scale across the entire axis. However, the plot above has different scales for the Abortion and Cancer Screening lines—the start of the Abortion line and end of the Cancer Screening line lie close to each other on the y-axis but represent vastly different numbers. In addition, only points from 2006 and 2013 are plotted but the x-axis contains unnecessary tick marks for every year in between.
To improve this plot, we should re-plot the points on the same y-axis scale:
# HIDDEN
pp = pd.read_csv("data/plannedparenthood.csv")
plt.plot(pp['year'], pp['screening'], linestyle="solid", marker="o", label='Cancer')
plt.plot(pp['year'], pp['abortion'], linestyle="solid", marker="o", label='Abortion')
plt.title('Planned Parenthood Procedures')
plt.xlabel("Year")
plt.ylabel("Service")
plt.xticks([2006, 2013])
plt.legend();
We can see that the change in number of Abortions is very small compared to the large drop in the number of Cancer Screenings. Instead of the number of procedures, we might instead be interested in the percent change in number.
# HIDDEN
percent_change = pd.DataFrame({
'percent_change': [
pp['screening'].iloc[1] / pp['screening'].iloc[0] - 1,
pp['abortion'].iloc[1] / pp['abortion'].iloc[0] - 1,
],
'procedure': ['cancer', 'abortion'],
'type': ['percent_change', 'percent_change'],
})
ax = sns.barplot(x='procedure', y='percent_change', data=percent_change)
plt.title('Percent Change in Number of Procedures')
plt.xlabel('')
plt.ylabel('Percent Change')
plt.ylim(-0.6, 0.6)
plt.axhline(y=0, c='black');
When selecting the x and y-axis limits we prefer to focus on the region with the bulk of the data, especially when working with long-tailed data. Consider the following plot and its zoomed in version to its right:

The plot on the right is much more helpful for making sense of the dataset. If needed, we can make multiple plots of different regions of the data to show the entire range of data. Later in this section, we discuss data transformations which also help visualize long-tailed data.
Principles of conditioning give us techniques to show distributions and relationships between subgroups of our data.
The US Bureau of Labor Statistics oversees scientific surveys related to the economic health of the US. Their website contains a tool to generate reports using this data that was used to generate this chart comparing median weekly earnings split by sex.
Which comparisons are easiest to make using this plot? Are these the comparisons that are most interesting or important?

This plot lets us see at a glance that weekly earnings tend to increase with more education. However, it is difficult to tell exactly how much each level of education increases earnings and it is even more difficult to compare male and female weekly earnings at the same education level. We can uncover both these trends by using a dot chart instead of a bar chart.
# HIDDEN
cps = pd.read_csv("data/edInc2.csv")
ax = sns.pointplot(x="educ", y="income", hue="gender", data=cps)
ticks = ["<HS", "HS", "<BA", "BA", ">BA"]
ax.set_xticklabels(ticks)
ax.set_xlabel("Education")
ax.set_ylabel("Income")
ax.set_title("2014 Median Weekly Earnings\nFull-Time Workers over 25 years old");
The lines connecting the points more clearly shows the relatively large effect of having a BA degree on weekly earnings. Placing the points for males and females directly above each other makes it much easier to see that the wage gap between males and females tends to increase with higher education levels.
To aid comparison of two subgroups within your data, align markers along the x or y-axis and use different colors or markers for different subgroups. Lines tend to show trends in data more clearly than bars and are a useful choice for both ordinal and numerical data.
Human perception has specific properties that are important to consider in visualization design. The first important property of human perception is that we perceive some colors more strongly than others, especially green colors. In addition, we perceive lighter shaded areas as larger than darker shaded ones. For example, in the weekly earnings plot that we just discussed, the lighter bars seem to draw more attention than the darker colored ones:

Practically speaking, you should ensure that your charts' color palettes are perceptually uniform. This means that, for example, the perceived intensity of the color won't change in between bars in a bar chart. For quantitative data, you have two choices: if your data progress from low to high and you want to emphasize large values, use a sequential color scheme which assigns lighter colors to large values. If both low and high values should be emphasized, use a diverging color scheme which assigns ligher colors to values closer to the center.
seaborn comes with many useful color palettes built-in. You can browse its documentation to learn how to switch between color palettes: http://seaborn.pydata.org/tutorial/color_palettes.html
A second important property of human perception is that we are generally more accurate when we compare lengths and less accurate when we compare areas. Consider the following chart of the GDP of African countries.

By numerical value, South Africa has twice the GDP of Algeria but it's not easy to tell from the plot above. Instead, we can plot the GDPs on a dot plot:

This is much more clear because it allows us to compare lengths instead of areas. Pie charts and three-dimensional charts are difficult to interpret for the same reason; we tend to avoid these charts in practice.
Our third and final property of perception is that the human eye has difficulty with changing baselines. Consider the following stacked area chart that plots carbon dioxide emissions over time split by country.

It is difficult to see whether the UK's emissions have increased or decreased over time because the of jiggling baseline problem: the baseline (bottom line) of the area jiggles up and down. It is also difficult to compare whether the UK's emissions are greater than China's emissions when the two heights are similar (in year 2000, for example).
Similar issues of jiggling baselines appear in stacked bar charts. In the plot below, it is difficult to compare the number of 15-64 year olds between Germany and Mexico.

We can often improve a stacked area or bar chart by switching to a line chart. Here's the data of emissions over time plotted as lines instead of areas:
# HIDDEN
co2 = pd.read_csv("data/CAITcountryCO2.csv", skiprows = 2,
names = ["Country", "Year", "CO2"])
last_year = co2.Year.iloc[-1]
q = f"Country != 'World' and Country != 'European Union (15)' and Year == {last_year}"
top14_lasty = co2.query(q).sort_values('CO2', ascending=False).iloc[:14]
top14 = co2[co2.Country.isin(top14_lasty.Country) & (co2.Year >= 1950)]
from cycler import cycler
linestyles = (['-', '--', ':', '-.']*3)[:7]
colors = sns.color_palette('colorblind')[:4]
lines_c = cycler('linestyle', linestyles)
color_c = cycler('color', colors)
fig, ax = plt.subplots(figsize=(9, 9))
ax.set_prop_cycle(lines_c * color_c)
x, y ='Year', 'CO2'
for name, df in top14.groupby('Country'):
ax.semilogy(df[x], df[y], label=name)
ax.set_xlabel(x)
ax.set_ylabel(y + "Emissions [Million Tons]")
ax.legend(ncol=2, frameon=True);
This plot does not jiggle the baseline so it is much easier to compare emissions between countries. We can also more clearly see which countries increased emissions the most.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/06'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
In this section, we discuss principles of visualization for transformation, context, and smoothing.
The principles of data transformation give us useful ways to alter data for visualization in order to more effectively reveal trends. We most commonly apply data transformations to reveal patterns in skewed data and non-linear relationships between variables.
The plot below shows the distribution of ticket fares for each passenger aboard the Titanic. As you can see, the distribution is skewed right.
# HIDDEN
ti = sns.load_dataset('titanic')
sns.distplot(ti['fare'])
plt.title('Fares for Titanic Passengers')
plt.xlabel('Fare in USD')
plt.ylabel('Density');
Although this histogram shows the all the fares, it is difficult to see detailed patterns in the data since the fares are clumped on the left side of the histogram. To remedy this, we can take the natural log of the fares before plotting them:
# HIDDEN
sns.distplot(np.log(ti.loc[ti['fare'] > 0, 'fare']), bins=25)
plt.title('log(Fares) for Titanic Passengers')
plt.xlabel('log(Fare) in USD')
plt.ylabel('Density');
We can see from the plot of the log data that the distribution of fares has a mode at roughly and a smaller mode at roughly . Why does plotting the natural log of the data help with skew? The logarithms of large numbers tend be close to the logarithms of small numbers:
| value | log(value) |
|---|---|
| 1 | 0.00 |
| 10 | 2.30 |
| 50 | 3.91 |
| 100 | 4.60 |
| 500 | 6.21 |
| 1000 | 6.90 |
This means that taking the logarithm of right-tailed data will bring large values close to small values. This helps see patterns where the majority of the data lie.
In fact, the logarithm is considered the Swiss army knife of data tranformation—it also helps us see the nature of non-linear relationships between variables in the data. In 1619, Kepler recorded down the following set of data to discover his Third Law of Planetary Motion:
planets = pd.read_csv("data/planets.data", delim_whitespace=True,
comment="#", usecols=[0, 1, 2])
planets
If we plot the mean distance to the sun against the period of the orbit, we can see a relationship that doesn't quite look linear:
sns.lmplot(x='mean_dist', y='period', data=planets, ci=False)
However, if we take the natural log of both mean distance and period, we obtain the following plot:
sns.lmplot(x='mean_dist', y='period',
data=np.log(planets.iloc[:, [1, 2]]),
ci=False);
We see a near-perfect linear relationship between the logged values of mean distance and period. What does this mean? Since we believe there's a linear relationship between the logged values, we can derive:
We replaced with in the last step to represent as a constant. The algebraic manipulation above shows that when two variables have a polynomial relationship, the log of the two variables has a linear relationship. In fact, we can find the degree of the polynomial by examining the slope of the line. In this case, the slope is 1.5 which gives us Kepler's third law: .
By a similar derivation we can also show that if the relationship between the and is linear, the two variables have an exponential relationship: .
Thus, we can use the logarithm to reveal patterns in right-tailed data and common non-linear relationships between variables.
Other common data transformations include the Box-Cox transformation and polynomial transforms.
It is important to add as much relevant context as possible to any plot you plan to share more broadly. For example, the following plot shows its data clearly but provides little context to help understand what is being plotted.

To provide context, we add a title, caption, axes labels, units for the axes, and labels for the plotted lines.

(This blog post explains how to make these modifications using matplotlib.)
In general, we provide context for a plot through:
Smoothing allows us to more clearly visualize data when we have many data points. We've actually already seen an instance of smoothing: histograms are a type of smoothing for rugplots. This rugplot shows each age of the passengers in the Titanic.
ages = ti['age'].dropna()
sns.rugplot(ages, height=0.2)
There are many marks that make it difficult to tell where the data lie. In addition, some of the points overlap, making it impossible to see how many points lie at 0. This issue is called overplotting and we generally avoid it whenever possible.
To reveal the distribution of the data, we can replace groups of marks with a bar that is taller when more points are in the group. Smoothing refers to this process of replacing sets of points with appropriate markers; we choose not to show every single point in the dataset in order to reveal broader trends.
sns.distplot(ages, kde=False)
We've also seen that seaborn will plot a smooth curve over a histogram by default.
sns.distplot(ages)
This is another form of smoothing called kernel density estimation (KDE). Instead of grouping points together and plotting bars, KDE places a curve on each point and combines the individual curves to create a final estimation of the distribution. Consider the rugplot below that shows three points.
# HIDDEN
points = np.array([2, 3, 5])
sns.rugplot(points, height=0.2)
plt.xlim(0, 7);
To perform KDE, we place a Gaussian (normal) distribution on each point:
# HIDDEN
from scipy.stats import norm
def gaussians(points, scale=True, sd=0.5):
x_vals = [np.linspace(point - 2, point + 2, 100) for point in points]
y_vals = [norm.pdf(xs, loc=point, scale=sd) for xs, point in zip(x_vals, points)]
if scale:
y_vals = [ys / len(points) for ys in y_vals]
return zip(x_vals, y_vals)
for xs, ys in gaussians(points, scale=False):
plt.plot(xs, ys, c=sns.color_palette()[0])
sns.rugplot(points, height=0.2)
plt.xlim(0, 7)
plt.ylim(0, 1);
The area under each Gaussian curve is equal to 1. Since we will sum multiple curves together, we scale each curve so that when added together the area under all the curves is equal to 1.
# HIDDEN
for xs, ys in gaussians(points):
plt.plot(xs, ys, c=sns.color_palette()[0])
sns.rugplot(points, height=0.2)
plt.xlim(0, 7)
plt.ylim(0, 1);
Finally, we add the curves together to create a final smooth estimate for the distribution:
# HIDDEN
sns.rugplot(points, height=0.2)
sns.kdeplot(points, bw=0.5)
plt.xlim(0, 7)
plt.ylim(0, 1);
By following this procedure, we can use KDE to smooth many points.
# Show the original unsmoothed points
sns.rugplot(ages, height=0.1)
# Show the smooth estimation of the distribution
sns.kdeplot(ages);
In the previous examples of KDE, we placed a miniature Gaussian curve on each point and added the Gaussians together.
# HIDDEN
for xs, ys in gaussians(points):
plt.plot(xs, ys, c=sns.color_palette()[0])
sns.rugplot(points, height=0.2)
plt.xlim(0, 7)
plt.ylim(0, 1);
We are free to adjust the width of the Gaussians. For example, we can make each Gaussian narrower. This is called decreasing the bandwidth of the kernel estimation.
# HIDDEN
for xs, ys in gaussians(points, sd=0.3):
plt.plot(xs, ys, c=sns.color_palette()[0])
sns.rugplot(points, height=0.2)
plt.xlim(0, 7)
plt.ylim(0, 1);
When we add these narrower Gaussians together, we create a more detailed final estimation.
# HIDDEN
sns.rugplot(points, height=0.2)
sns.kdeplot(points, bw=0.2)
plt.xlim(0, 7)
plt.ylim(0, 1);
# Plot the KDE for Titanic passenger ages using a lower bandwidth
sns.rugplot(ages, height=0.1)
sns.kdeplot(ages, bw=0.5);
Just like adjusting bins for a histogram, we typically adjust the bandwidth until we believe the final plot shows the distribution without distracting the viewer with too much detail.
Although we have placed a Gaussian at each point so far, we can easily select other functions to estimate each point. This is called changing the kernel of the kernel density estimation. Previously, we've used a Gaussian kernel. Now, we'll use a triangular kernel which places a pair of stepwise sloped lines at each point:
# HIDDEN
sns.rugplot(points, height=0.2)
sns.kdeplot(points, kernel='tri', bw=0.3)
# Plot the KDE for Titanic passenger ages using a triangular kernel
sns.rugplot(ages, height=0.1)
sns.kdeplot(ages, kernel='tri');
Usually we'll use a Gaussian kernel unless we have a specific reason to use a different kernel.
We can also smooth two-dimensional plots when we encounter the problem of overplotting.
The following example comes from a dataset released by the Cherry Blossom Run, an annual 10-mile run in Washington D.C. Each runner can report their age and their race time; we've plotted all the reported data points in the scatter plot below.
runners = pd.read_csv('data/cherryBlossomMen.csv').dropna()
runners
sns.lmplot(x='age', y='time', data=runners, fit_reg=False);
So many points lie on top of each other that it's difficult to see any trend at all!
We can smooth the scatter plot using kernel density estimation in two dimensions. When KDE is applied to a two-dimensional plot, we place a three-dimensional Gaussian at each point. In three dimensions, the Gaussian looks like a mountain pointing out of the page.
# Plot three points
two_d_points = pd.DataFrame({'x': [1, 3, 4], 'y': [4, 3, 1]})
sns.lmplot(x='x', y='y', data=two_d_points, fit_reg=False)
plt.xlim(-2, 7)
plt.ylim(-2, 7);
# Place a Gaussian at each point and use a contour plot to show each one
sns.kdeplot(two_d_points['x'], two_d_points['y'], bw=0.4)
plt.xlim(-2, 7)
plt.ylim(-2, 7);
Just like we've previously seen, we scale each Gaussian and add them together to obtain a final contour plot for the scatter plot.
# HIDDEN
sns.kdeplot(two_d_points['x'], two_d_points['y'])
plt.xlim(-2, 7)
plt.ylim(-2, 7);
The resulting plot shows the downward sloping trend of the three points. Similarly, we can apply a KDE to smooth out the scatter plot of runner ages and times:
sns.kdeplot(runners['age'], runners['time'])
plt.xlim(-10, 70)
plt.ylim(3000, 8000);
We can see that most of our runners were between 25 and 50 years old, and that most runners took between 4000 and 7000 seconds (roughly between 1 and 2 hours) to finish the race.
We can see more clearly that there is a suspicious group of runners that are between zero and ten years old. We might want to double check that our data for those ages was recorded properly.
We can also see a slight upward trend in the time taken to finish the race as runner age increases.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/06'))
Unfortunately, many charts in the world at large do not adhere to our principles for data visualization. Here is a page of charts that appear on the first page of a Google search for "good data visualization" conducted in Spring of 2018. How many poorly made charts can you spot?

Effective data visualization reveals the data. All of our principles for visualization aim to make the data more understandable for the viewer. Data visualization is sometimes used to mislead and misinform. When done properly, however, data visualization is one of our greatest tools for discovering, revealing, and communicating trends and anomalies in our data.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/07'))
Before the Internet, data scientists had to physically move hard disk drives to share data with others. Now, we can freely retrieve datasets from computers across the world.
Although we use the Internet to download and share data files, the web pages on the Internet themselves contain huge amounts of information as text, images, and videos. By learning web technologies, we can use the Web as a data source. In this chapter, we introduce HTTP, the primary communication protocol for the Web, and XML/HTML, the primary document formats for web pages.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/07'))
HTTP (AKA HyperText Transfer Protocol) is a request-response protocol that allows one computer to talk to another over the Internet.
The Internet allows computers to send text to one another, but does not impose any restrictions on what that text contains. HTTP defines a structure on the text communication between one computer (client) and another (server). In this protocol, a client submits a request to a server, a specially formatted text message. The server sends a text response back to the client.
The command line tool curl gives us a simple way to send HTTP requests. In the output below, lines starting with > indicate the text sent in our request; the remaining lines are the server's response.
$ curl -v https://httpbin.org/html
> GET /html HTTP/1.1
> Host: httpbin.org
> User-Agent: curl/7.55.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Connection: keep-alive
< Server: meinheld/0.6.1
< Date: Wed, 11 Apr 2018 18:15:03 GMT
<
<html>
<body>
<h1>Herman Melville - Moby-Dick</h1>
<p>
Availing himself of the mild...
</p>
</body>
</html>
Running the curl command above causes the client's computer to construct a text message that looks like:
GET /html HTTP/1.1
Host: httpbin.org
User-Agent: curl/7.55.1
Accept: */*
{blank_line}
This message follows a specific format: it starts with GET /html HTTP/1.1 which indicates that the message is an HTTP GET request to the /html page. Each of the three lines that follow form HTTP headers, optional information that curl sends to the server. The HTTP headers have the format {name}: {value}. Finally, the blank line at the end of the message tells the server that the message ends after three headers. Note that we've marked the blank line with {blank_line} in the snippet above; in the actual message {blank_line} is replaced with a blank line.
The client's computer then uses the Internet to send this message to the https://httpbin.org web server. The server processes the request, and sends the following response:
HTTP/1.1 200 OK
Connection: keep-alive
Server: meinheld/0.6.1
Date: Wed, 11 Apr 2018 18:15:03 GMT
{blank_line}
The first line of the response states that the request completed successfully. The following three lines form the HTTP response headers, optional information that the server sends back to the client. Finally, the blank line at the end of the message tells the client that the server has finished sending its response headers and will next send the response body:
<html>
<body>
<h1>Herman Melville - Moby-Dick</h1>
<p>
Availing himself of the mild...
</p>
</body>
</html>
This HTTP protocol is used in almost every application that interacts with the Internet. For example, visiting https://httpbin.org/html in your web browser makes the same basic HTTP request as the curl command above. Instead of displaying the response as plain text as we have above, your browser recognizes that the text is an HTML document and will display it accordingly.
In practice, we will not write out full HTTP requests in text. Instead, we use tools like curl or Python libraries to construct requests for us.
The Python requests library allows us to make HTTP requests in Python. The code below makes the same HTTP request as running curl -v https://httpbin.org/html.
import requests
url = "https://httpbin.org/html"
response = requests.get(url)
response
Let's take a closer look at the request we made. We can access the original request using response object; we display the request's HTTP headers below:
request = response.request
for key in request.headers: # The headers in the response are stored as a dictionary.
print(f'{key}: {request.headers[key]}')
Every HTTP request has a type. In this case, we used a GET request which retrieves information from a server.
request.method
Let's examine the response we received from the server. First, we will print the response's HTTP headers.
for key in response.headers:
print(f'{key}: {response.headers[key]}')
An HTTP response contains a status code, a special number that indicates whether the request succeeded or failed. The status code 200 indicates that the request succeeded.
response.status_code
Finally, we display the first 100 characters of the response's content (the entire response content is too long to display nicely here).
response.text[:100]
The request we made above was a GET HTTP request. There are multiple HTTP request types; the most important two are GET and POST requests.
The GET request is used to retrieve information from the server. Since your web browser makes GET request whenever you enter in a URL into its address bar, GET requests are the most common type of HTTP requests.
curl uses GET requests by default, so running curl https://www.google.com/ makes a GET request to https://www.google.com/.
The POST request is used to send information from the client to the server. For example, some web pages contain forms for the user to fill out—a login form, for example. After clicking the "Submit" button, most web browsers will make a POST request to send the form data to the server for processing.
Let's look an example of a POST request that sends 'sam' as the parameter 'name'. This one can be done by running curl -d 'name=sam' https://httpbin.org/post on the command line.
Notice that our request has a body this time (filled with the parameters of the POST request), and the content of the response is different from our GET response from before.
Like HTTP headers, the data sent in a POST request uses a key-value format. In Python, we can make a POST request by using requests.post and passing in a dictionary as an argument.
post_response = requests.post("https://httpbin.org/post",
data={'name': 'sam'})
post_response
The server will respond with a status code to indicate whether the POST request successfully completed. In addition, the server will usually send a response body to display to the client.
post_response.status_code
post_response.text
The previous HTTP responses had the HTTP status code 200. This status code indicates that the request completed successfully. There are hundreds of other HTTP status codes. Thankfully, they are grouped into categories to make them easier to remember:
We can look at examples of some of these errors.
# This page doesn't exist, so we get a 404 page not found error
url = "https://www.youtube.com/404errorwow"
errorResponse = requests.get(url)
print(errorResponse)
# This specific page results in a 500 server error
url = "https://httpstat.us/500"
serverResponse = requests.get(url)
print(serverResponse)
We have introduced the HTTP protocol, the basic communication method for applications that use the Web. Although the protocol specifies a specific text format, we typically turn to other tools to make HTTP requests for us, such as the command line tool curl and the Python library requests.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/08'))
A great quantity of data resides not as numbers in CSVs but as free-form text in books, documents, blog posts, and Internet comments. While numerical and categorical data are often collected from physical phenomena, textual data arises from human communication and expression. As with most types of data, there are a multitude of techniques for working with text that would take multiple books to explain in full detail. In this chapter, we introduce a small subset of these techniques that provide a variety of useful operations for working with text: Python string manipulation and regular expressions.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/08'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
Python provides a variety of methods for basic string manipulation. Although simple, these methods form the primitives that piece together to form more complex string operations. We will introduce Python's string methods in the context of a common use case for working with text: data cleaning.
Data often comes from several different sources that each implements its own way of encoding information. In the following example, we have one table that records the state that a county belongs to and another that records the population of the county.
# HIDDEN
state = pd.DataFrame({
'County': [
'De Witt County',
'Lac qui Parle County',
'Lewis and Clark County',
'St John the Baptist Parish',
],
'State': [
'IL',
'MN',
'MT',
'LA',
]
})
population = pd.DataFrame({
'County': [
'DeWitt ',
'Lac Qui Parle',
'Lewis & Clark',
'St. John the Baptist',
],
'Population': [
'16,798',
'8,067',
'55,716',
'43,044',
]
})
state
population
We would naturally like to join the state and population tables using the County column. Unfortunately, not a single county is spelled the same in the two tables. This example is illustrative of the following common issues in text data:
qui vs QuiSt. vs St County/Parish is absent in the population tableDeWitt vs De Witt& vs andPython's string methods allow us to start resolving these issues. These methods are conveniently defined on all Python strings and thus do not require importing other modules. Although it is worth familiarizing yourself with the complete list of string methods, we describe a few of the most commonly used methods in the table below.
| Method | Description |
|---|---|
str[x:y] |
Slices str, returning indices x (inclusive) to y (not inclusive) |
str.lower() |
Returns a copy of a string with all letters converted to lowercase |
str.replace(a, b) |
Replaces all instances of the substring a in str with the substring b |
str.split(a) |
Returns substrings of str split at a substring a |
str.strip() |
Removes leading and trailing whitespace from str |
We select the string for St. John the Baptist parish from the state and population tables and apply string methods to remove capitalization, punctuation, and county/parish occurrences.
john1 = state.loc[3, 'County']
john2 = population.loc[3, 'County']
(john1
.lower()
.strip()
.replace(' parish', '')
.replace(' county', '')
.replace('&', 'and')
.replace('.', '')
.replace(' ', '')
)
Applying the same set of methods to john2 allows us to verify that the two strings are now identical.
(john2
.lower()
.strip()
.replace(' parish', '')
.replace(' county', '')
.replace('&', 'and')
.replace('.', '')
.replace(' ', '')
)
Satisfied, we create a method called clean_county that normalizes an input county.
def clean_county(county):
return (county
.lower()
.strip()
.replace(' county', '')
.replace(' parish', '')
.replace('&', 'and')
.replace(' ', '')
.replace('.', ''))
We may now verify that the clean_county method produces matching counties for all the counties in both tables:
([clean_county(county) for county in state['County']],
[clean_county(county) for county in population['County']]
)
Because each county in both tables has the same transformed representation, we may successfully join the two tables using the transformed county names.
In the code above we used a loop to transform each county name. pandas Series objects provide a convenient way to apply string methods to each item in the series. First, the series of county names in the state table:
state['County']
The .str property on pandas Series exposes the same string methods as Python does. Calling a method on the .str property calls the method on each item in the series.
state['County'].str.lower()
This allows us to transform each string in the series without using a loop.
(state['County']
.str.lower()
.str.strip()
.str.replace(' parish', '')
.str.replace(' county', '')
.str.replace('&', 'and')
.str.replace('.', '')
.str.replace(' ', '')
)
We save the transformed counties back into their originating tables:
state['County'] = (state['County']
.str.lower()
.str.strip()
.str.replace(' parish', '')
.str.replace(' county', '')
.str.replace('&', 'and')
.str.replace('.', '')
.str.replace(' ', '')
)
population['County'] = (population['County']
.str.lower()
.str.strip()
.str.replace(' parish', '')
.str.replace(' county', '')
.str.replace('&', 'and')
.str.replace('.', '')
.str.replace(' ', '')
)
Now, the two tables contain the same string representation of the counties:
state
population
It is simple to join these tables once the counties match.
state.merge(population, on='County')
Python's string methods form a set of simple and useful operations for string manipulation. pandas Series implement the same methods that apply the underlying Python method to each string in the series.
You may find the complete docs on Python's string methods here and the docs on Pandas str methods here.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/08'))
In this section we introduce regular expressions, an important tool to specify patterns in strings.
In a larger piece of text, many useful substrings come in a specific format. For instance, the sentence below contains a U.S. phone number.
"give me a call, my number is 123-456-7890."
The phone number contains the following pattern:
Given a free-form segment of text, we might naturally wish to detect and extract the phone numbers. We may also wish to extract specific pieces of the phone numbers—for example, by extracting the area code we may deduce the locations of individuals mentioned in the text.
To detect whether a string contains a phone number, we may attempt to write a method like the following:
def is_phone_number(string):
digits = '0123456789'
def is_not_digit(token):
return token not in digits
# Three numbers
for i in range(3):
if is_not_digit(string[i]):
return False
# Followed by a dash
if string[3] != '-':
return False
# Followed by three numbers
for i in range(4, 7):
if is_not_digit(string[i]):
return False
# Followed by a dash
if string[7] != '-':
return False
# Followed by four numbers
for i in range(8, 12):
if is_not_digit(string[i]):
return False
return True
is_phone_number("382-384-3840")
is_phone_number("phone number")
The code above is unpleasant and verbose. Rather than manually loop through the characters of the string, we would prefer to specify a pattern and command Python to match the pattern.
Regular expressions (often abbreviated regex) conveniently solve this exact problem by allowing us to create general patterns for strings. Using a regular expression, we may re-implement the is_phone_number method in two short lines of Python:
import re
def is_phone_number(string):
regex = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
return re.search(regex, string) is not None
is_phone_number("382-384-3840")
In the code above, we use the regex [0-9]{3}-[0-9]{3}-[0-9]{4} to match phone numbers. Although cryptic at a first glance, the syntax of regular expressions is fortunately much simpler to learn than the Python language itself; we introduce nearly all of the syntax in this section alone.
We will also introduce the built-in Python module re that performs string operations using regexes.
We start with the syntax of regular expressions. In Python, regular expressions are most commonly stored as raw strings. Raw strings behave like normal Python strings without special handling for backslashes.
For example, to store the string hello \ world in a normal Python string, we must write:
# Backslashes need to be escaped in normal Python strings
some_string = 'hello \\ world'
print(some_string)
Using a raw string removes the need to escape the backslash:
# Note the `r` prefix on the string
some_raw_string = r'hello \ world'
print(some_raw_string)
Since backslashes appear often in regular expressions, we will use raw strings for all regexes in this section.
A literal character in a regular expression matches the character itself. For example, the regex r"a" will match any "a" in "Say! I like green eggs and ham!". All alphanumeric characters and most punctuation characters are regex literals.
# HIDDEN
def show_regex_match(text, regex):
"""
Prints the string with the regex match highlighted.
"""
print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text))
# The show_regex_match method highlights all regex matches in the input string
regex = r"green"
show_regex_match("Say! I like green eggs and ham!", regex)
show_regex_match("Say! I like green eggs and ham!", r"a")
In the example above we observe that regular expressions can match patterns that appear anywhere in the input string. In Python, this behavior differs depending on the method used to match the regex—some methods only return a match if the regex appears at the start of the string; some methods return a match anywhere in the string.
Notice also that the show_regex_match method highlights all occurrences of the regex in the input string. Again, this differs depending on the Python method used—some methods return all matches while some only return the first match.
Regular expressions are case-sensitive. In the example below, the regex only matches the lowercase s in eggs, not the uppercase S in Say.
show_regex_match("Say! I like green eggs and ham!", r"s")
Some characters have special meaning in a regular expression. These meta characters allow regexes to match a variety of patterns.
In a regular expression, the period character . matches any character except a newline.
show_regex_match("Call me at 382-384-3840.", r".all")
To match only the literal period character we must escape it with a backslash:
show_regex_match("Call me at 382-384-3840.", r"\.")
By using the period character to mark the parts of a pattern that vary, we construct a regex to match phone numbers. For example, we may take our original phone number 382-384-3840 and replace the numbers with ., leaving the dashes as literals. This results in the regex ...-...-.....
show_regex_match("Call me at 382-384-3840.", "...-...-....")
Since the period character matches all characters, however, the following input string will produce a spurious match.
show_regex_match("My truck is not-all-blue.", "...-...-....")
A character class matches a specified set of characters, allowing us to create more restrictive matches than the . character alone. To create a character class, wrap the set of desired characters in brackets [ ].
show_regex_match("I like your gray shirt.", "gr[ae]y")
show_regex_match("I like your grey shirt.", "gr[ae]y")
# Does not match; a character class only matches one character from a set
show_regex_match("I like your graey shirt.", "gr[ae]y")
# In this example, repeating the character class will match
show_regex_match("I like your graey shirt.", "gr[ae][ae]y")
In a character class, the . character is treated as a literal, not as a wildcard.
show_regex_match("I like your grey shirt.", "irt[.]")
There are a few special shorthand notations we can use for commonly used character classes:
| Shorthand | Meaning |
|---|---|
| [0-9] | All the digits |
| [a-z] | Lowercase letters |
| [A-Z] | Uppercase letters |
show_regex_match("I like your gray shirt.", "y[a-z]y")
Character classes allow us to create a more specific regex for phone numbers.
# We replaced every `.` character in ...-...-.... with [0-9] to restrict
# matches to digits.
phone_regex = r'[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
show_regex_match("Call me at 382-384-3840.", phone_regex)
# Now we no longer match this string:
show_regex_match("My truck is not-all-blue.", phone_regex)
A negated character class matches any character except the characters in the class. To create a negated character class, wrap the negated characters in [^ ].
show_regex_match("The car parked in the garage.", r"[^c]ar")
To create a regex to match phone numbers, we wrote:
[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]
This matches 3 digits, a dash, 3 more digits, a dash, and 4 more digits.
Quantifiers allow us to match multiple consecutive appearances of a pattern. We specify the number of repetitions by placing the number in curly braces { }.
phone_regex = r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
show_regex_match("Call me at 382-384-3840.", phone_regex)
# No match
phone_regex = r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
show_regex_match("Call me at 12-384-3840.", phone_regex)
A quantifier always modifies the character or character class to its immediate left. The following table shows the complete syntax for quantifiers.
| Quantifier | Meaning |
|---|---|
| {m, n} | Match the preceding character m to n times. |
| {m} | Match the preceding character exactly m times. |
| {m,} | Match the preceding character at least m times. |
| {,n} | Match the preceding character at most n times. |
Shorthand Quantifiers
Some commonly used quantifiers have a shorthand:
| Symbol | Quantifier | Meaning |
|---|---|---|
| * | {0,} | Match the preceding character 0 or more times |
| + | {1,} | Match the preceding character 1 or more times |
| ? | {0,1} | Match the preceding charcter 0 or 1 times |
We use the * character instead of {0,} in the following examples.
# 3 a's
show_regex_match('He screamed "Aaaah!" as the cart took a plunge.', "Aa*h!")
# Lots of a's
show_regex_match(
'He screamed "Aaaaaaaaaaaaaaaaaaaah!" as the cart took a plunge.',
"Aa*h!"
)
# No lowercase a's
show_regex_match('He screamed "Ah!" as the cart took a plunge.', "Aa*h!")
Quantifiers are greedy
Quantifiers will always return the longest match possible. This sometimes results in surprising behavior:
# We tried to match 311 and 911 but matched the ` and ` as well because
# `<311> and <911>` is the longest match possible for `<.+>`.
show_regex_match("Remember the numbers <311> and <911>", "<.+>")
In many cases, using a more specific character class prevents these false matches:
show_regex_match("Remember the numbers <311> and <911>", "<[0-9]+>")
Sometimes a pattern should only match at the beginning or end of a string. The special character ^ anchors the regex to match only if the pattern appears at the beginning of the string; the special character $ anchors the regex to match only if the pattern occurs at the end of the string. For example the regex well$ only matches an appearance of well at the end of the string.
show_regex_match('well, well, well', r"well$")
Using both ^ and $ requires the regex to match the full string.
phone_regex = r"^[0-9]{3}-[0-9]{3}-[0-9]{4}$"
show_regex_match('382-384-3840', phone_regex)
# No match
show_regex_match('You can call me at 382-384-3840.', phone_regex)
All regex meta characters have special meaning in a regular expression. To match meta characters as literals, we escape them using the \ character.
# `[` is a meta character and requires escaping
show_regex_match("Call me at [382-384-3840].", "\[")
# `.` is a meta character and requires escaping
show_regex_match("Call me at [382-384-3840].", "\.")
We have now covered the most important pieces of regex syntax and meta characters. For a more complete reference, we include the tables below.
Meta Characters
This table includes most of the important meta characters, which help us specify certain patterns we want to match in a string.
| Char | Description | Example | Matches | Doesn't Match |
|---|---|---|---|---|
| . | Any character except \n | ... |
abc | ab abcd |
| [ ] | Any character inside brackets | [cb.]ar |
car .ar |
jar |
| [^ ] | Any character not inside brackets | [^b]ar |
car par |
bar ar |
| * | ≥ 0 or more of last symbol | [pb]*ark |
bbark ark |
dark |
| + | ≥ 1 or more of last symbol | [pb]+ark |
bbpark bark |
dark ark |
| ? | 0 or 1 of last symbol | s?he |
she he |
the |
| {n} | Exactly n of last symbol | hello{3} |
hellooo | hello |
| | | Pattern before or after bar | we|[ui]s |
we us is |
e s |
| \ | Escapes next character | \[hi\] |
[hi] | hi |
| ^ | Beginning of line | ^ark |
ark two | dark |
| $ | End of line | ark$ |
noahs ark | noahs arks |
Shorthand Character Sets
Some commonly used character sets have shorthands.
| Description | Bracket Form | Shorthand |
|---|---|---|
| Alphanumeric character | [a-zA-Z0-9] |
\w |
| Not an alphanumeric character | [^a-zA-Z0-9] |
\W |
| Digit | [0-9] |
\d |
| Not a digit | [^0-9] |
\D |
| Whitespace | [\t\n\f\r\p{Z}] |
\s |
| Not whitespace | [^\t\n\f\r\p{z}] |
\S |
Almost all programming languages have a library to match patterns using regular expressions, making them useful regardless of the specific language. In this section, we introduce regex syntax and the most useful meta characters.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/08'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
In this section, we introduce regex usage in Python using the built-in re module. Since we only cover a few of the most commonly used methods, you will find it useful to consult the official documentation on the re module as well.
re.search¶re.search(pattern, string) searches for a match of the regex pattern anywhere in string. It returns a truthy match object if the pattern is found; it returns None if not.
phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text = "Call me at 382-384-3840."
match = re.search(phone_re, text)
match
Although the returned match object has a variety of useful properties, we most commonly use re.search to test whether a pattern appears in a string.
if re.search(phone_re, text):
print("Found a match!")
if re.search(phone_re, 'Hello world'):
print("No match; this won't print")
Another commonly used method, re.match(pattern, string), behaves the same as re.search but only checks for a match at the start of string instead of a match anywhere in the string.
re.findall¶We use re.findall(pattern, string) to extract substrings that match a regex. This method returns a list of all matches of pattern in string.
gmail_re = r'[a-zA-Z0-9]+@gmail\.com'
text = '''
From: email1@gmail.com
To: email2@yahoo.com and email3@gmail.com
'''
re.findall(gmail_re, text)
Using regex groups, we specify subpatterns to extract from a regex by wrapping the subpattern in parentheses ( ). When a regex contains regex groups, re.findall returns a list of tuples that contain the subpattern contents.
For example, the following familiar regex extracts phone numbers from a string:
phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
re.findall(phone_re, text)
To split apart the individual three or four digit components of a phone number, we can wrap each digit group in parentheses.
# Same regex with parentheses around the digit groups
phone_re = r"([0-9]{3})-([0-9]{3})-([0-9]{4})"
text = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
re.findall(phone_re, text)
As promised, re.findall returns a list of tuples containing the individual components of the matched phone numbers.
re.sub¶re.sub(pattern, replacement, string) replaces all occurrences of pattern with replacement in the provided string. This method behaves like the Python string method str.sub but uses a regex to match patterns.
In the code below, we alter the dates to have a common format by substituting the date separators with a dash.
messy_dates = '03/12/2018, 03.13.18, 03/14/2018, 03:15:2018'
regex = r'[/.:]'
re.sub(regex, '-', messy_dates)
re.split¶re.split(pattern, string) splits the input string each time the regex pattern appears. This method behaves like the Python string method str.split but uses a regex to make the split.
In the code below, we use re.split to split chapter names from their page numbers in a table of contents for a book.
toc = '''
PLAYING PILGRIMS============3
A MERRY CHRISTMAS===========13
THE LAURENCE BOY============31
BURDENS=====================55
BEING NEIGHBORLY============76
'''.strip()
# First, split into individual lines
lines = re.split('\n', toc)
lines
# Then, split into chapter title and page number
split_re = r'=+' # Matches any sequence of = characters
[re.split(split_re, line) for line in lines]
Recall that pandas Series objects have a .str property that supports string manipulation using Python string methods. Conveniently, the .str property also supports some functions from the re module. We demonstrate basic regex usage in pandas, leaving the complete method list to the pandas documentation on string methods.
We've stored the text of the first five sentences of the novel Little Women in the DataFrame below. We can use the string methods that pandas provides to extract the spoken dialog in each sentence.
# HIDDEN
text = '''
"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.
"We've got Father and Mother, and each other," said Beth contentedly from her corner.
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."
'''.strip()
little = pd.DataFrame({
'sentences': text.split('\n')
})
little
Since spoken dialog lies within double quotation marks, we create a regex that captures a double quotation mark, a sequence of any characters except a double quotation mark, and the closing quotation mark.
quote_re = r'"[^"]+"'
little['sentences'].str.findall(quote_re)
Since the Series.str.findall method returns a list of matches, pandas also provides Series.str.extract and Series.str.extractall method to extract matches into a Series or DataFrame. These methods require the regex to contain at least one regex group.
# Extract text within double quotes
quote_re = r'"([^"]+)"'
spoken = little['sentences'].str.extract(quote_re)
spoken
We can add this series as a column of the little DataFrame:
little['dialog'] = spoken
little
We can confirm that our string manipulation behaves as expected for the last sentence in our DataFrame by printing the original and extracted text:
print(little.loc[4, 'sentences'])
print(little.loc[4, 'dialog'])
The re module in Python provides a useful group of methods for manipulating text using regular expressions. When working with DataFrames, we often use the analogous string manipulation methods implemented in pandas.
For the complete documentation on the re module, see https://docs.python.org/3/library/re.html
For the complete documentation on pandas string methods, see https://pandas.pydata.org/pandas-docs/stable/text.html
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/09'))
Thus far we have worked with datasets that are stored as text files on a computer. While useful for analysis of small datasets, using text files to store data presents challenges for many real-world use cases.
Many datasets are collected by multiple people—a team of data scientists, for example. If the data are stored in text files, however, the team will likely have to send and download new versions of the files each time the data are updated. Text files alone do not provide a consistent point of data retrieval for multiple analysts to use. This issue, among others, makes text files difficult to use for larger datasets or teams.
We often turn to relational database management systems (RDBMSs) to store data, such as MySQL or PostgreSQL. To work with these systems, we use a query language called SQL instead of Python. In this chapter, we discuss the relational database model and introduce SQL.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/09'))
A database is an organized collection of data. In the past, data was stored in specialized data structures that were designed for specific tasks. For example, airlines might record flight bookings in a different format than a bank managing an account ledger. In 1969, Ted Codd introduced the relational model as a general method of storing data. Data is stored in two-dimensional tables called relations, consisting of individual observations in each row (commonly referred to as tuples). Each tuple is a structured data item that represents the relationship between certain attributes (columns). Each attribute of a relation has a name and data type.
Consider the purchases relation below:
| name | product | retailer | date purchased |
| Samantha | iPod | Best Buy | June 3, 2016 |
| Timothy | Chromebook | Amazon | July 8, 2016 |
| Jason | Surface Pro | Target | October 2, 2016 |
In purchases, each tuple represents the relationship between the name, product, retailer, and date purchased attributes.
A relation's schema contains its column names, data types, and constraints. For example, the schema of the purchases table states that the columns are name, product, retailer, and date purchased; it also states that each column contains text.
The following prices relation shows the price of certain gadgets at a few retail stores:
| retailer | product | price |
| Best Buy | Galaxy S9 | 719.00 |
| Best Buy | iPod | 200.00 |
| Amazon | iPad | 450.00 |
| Amazon | Battery pack | 24.87 |
| Amazon | Chromebook | 249.99 |
| Target | iPod | 215.00 |
| Target | Surface Pro | 799.00 |
| Target | Google Pixel 2 | 659.00 |
| Walmart | Chromebook | 238.79 |
We can then reference both tables simultaneously to determine how much Samantha, Timothy, and Jason paid for their respective gadgets (assuming prices at each store stay constant over time). Together, the two tables form a relational database, which is a collection of one or more relations. The schema of the entire database is the set of schemas of the individual relations in the database.
A relational database can be simply described as a set of tables containing rows of individual data entries. A relational database management system (RDBMSs) provides an interface to a relational database. Oracle, MySQL, and PostgreSQL are three of the most commonly used RDBMSs used in practice today.
Relational database management systems give users the ability to add, edit, and remove data from databases. These systems provide several key benefits over using a collection of text files to store data, including:
GPA only contains floats between 0.0 and 4.0.To work with data stored in a RDBMS, we use the SQL programming language.
How do RDBMSs and the pandas Python package differ? First, pandas is not concerned about data storage. Although DataFrames can read and write from multiple data formats, pandas does not dictate how the data are actually stored on the underlying computer like a RDBMS does. Second, pandas primarily provides methods for manipulating data while RDBMSs handle both data storage and data manipulation, making them more suitable for larger datasets. A typical rule of thumb is to use a RDBMS for datasets larger than several gigabytes. Finally, pandas requires knowledge of Python in order to use, whereas RDBMSs require knowledge of SQL. Since SQL is simpler to learn than Python, RDBMSs allow less technical users to store and query data, a handy trait.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/09'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
# Creating a table
sql_expr = """
CREATE TABLE prices(
retailer TEXT,
product TEXT,
price FLOAT);
"""
result = sqlite_engine.execute(sql_expr)
# HIDDEN
# Inserting records into the table
sql_expr = """
INSERT INTO prices VALUES
('Best Buy', 'Galaxy S9', 719.00),
('Best Buy', 'iPod', 200.00),
('Amazon', 'iPad', 450.00),
('Amazon', 'Battery pack', 24.87),
('Amazon', 'Chromebook', 249.99),
('Target', 'iPod', 215.00),
('Target', 'Surface Pro', 799.00),
('Target', 'Google Pixel 2', 659.00),
('Walmart', 'Chromebook', 238.79);
"""
result = sqlite_engine.execute(sql_expr)
# HIDDEN
import pandas as pd
prices = pd.DataFrame([['Best Buy', 'Galaxy S9', 719.00],
['Best Buy', 'iPod', 200.00],
['Amazon', 'iPad', 450.00],
['Amazon', 'Battery pack', 24.87],
['Amazon', 'Chromebook', 249.99],
['Target', 'iPod', 215.00],
['Target', 'Surface Pro', 799.00],
['Target', 'Google Pixel 2', 659.00],
['Walmart', 'Chromebook', 238.79]],
columns=['retailer', 'product', 'price'])
SQL (Structured Query Language) is a programming language that has operations to define, logically organize, manipulate, and perform calculations on data stored in a relational database management system (RDBMS).
SQL is a declarative language. This means that the user only needs to specify what kind of data they want, not how to obtain it. An example is shown below, with an imperative example for comparison:
In this chapter, we will write SQL queries as Python strings, then use pandas to execute the SQL query and read the result into a pandas DataFrame. As we walk through the basics of SQL syntax, we'll also occasionally show pandas equivalents for comparison purposes.
pandas¶To execute SQL queries from Python, we will connect to a database using the sqlalchemy library. Then we can use the pandas function pd.read_sql to execute SQL queries through this connection.
import sqlalchemy
# pd.read_sql takes in a parameter for a SQLite engine, which we create below
sqlite_uri = "sqlite:///sql_basics.db"
sqlite_engine = sqlalchemy.create_engine(sqlite_uri)
This database contains one relation: prices. To display the relation we run a SQL query. Calling read_sql will execute the SQL query on the RDBMS, then return the results in a pandas DataFrame.
sql_expr = """
SELECT *
FROM prices
"""
pd.read_sql(sql_expr, sqlite_engine)
Later in this section we will compare SQL queries with pandas method calls so we've created an identical DataFrame in pandas.
prices
All SQL queries take the general form below:
SELECT [DISTINCT] <column expression list>
FROM <relation>
[WHERE <predicate>]
[GROUP BY <column list>]
[HAVING <predicate>]
[ORDER BY <column list>]
[LIMIT <number>]
Note that:
SELECT and a FROM statement.FROM query blocks can reference one or more tables, although in this section we will only look at one table at a time for simplicity.The two mandatory statements in a SQL query are:
SELECT indicates the columns that we want to view.FROM indicates the tables from which we are selecting these columns.To display the entire prices table, we run:
sql_expr = """
SELECT *
FROM prices
"""
pd.read_sql(sql_expr, sqlite_engine)
SELECT * returns every column in the original relation. To display only the retailers that are represented in prices, we add the retailer column to the SELECT statement.
sql_expr = """
SELECT retailer
FROM prices
"""
pd.read_sql(sql_expr, sqlite_engine)
If we want a list of unique retailers, we can call the DISTINCT function to omit repeated values.
sql_expr = """
SELECT DISTINCT(retailer)
FROM prices
"""
pd.read_sql(sql_expr, sqlite_engine)
This would be the functional equivalent of the following pandas code:
prices['retailer'].unique()
Each RDBMS comes with its own set of functions that can be applied to attributes in the SELECT list, such as comparison operators, mathematical functions and operators, and string functions and operators. In Data 100 we use PostgreSQL, a mature RDBMS that comes with hundreds of such functions. The complete list is available here. Keep in mind that each RDBMS has a different set of functions for use in SELECT.
The following code converts all retailer names to uppercase and halves the product prices.
sql_expr = """
SELECT
UPPER(retailer) AS retailer_caps,
product,
price / 2 AS half_price
FROM prices
"""
pd.read_sql(sql_expr, sqlite_engine)
Notice that we can alias the columns (assign another name) with AS so that the columns appear with this new name in the output table. This does not modify the names of the columns in the source relation.
The WHERE clause allows us to specify certain constraints for the returned data; these constraints are often referred to as predicates. For example, to retrieve only gadgets that are under $500:
sql_expr = """
SELECT *
FROM prices
WHERE price < 500
"""
pd.read_sql(sql_expr, sqlite_engine)
We can also use the operators AND, OR, and NOT to further constrain our SQL query. To find an item on Amazon without a battery pack under $300, we write:
sql_expr = """
SELECT *
FROM prices
WHERE retailer = 'Amazon'
AND NOT product = 'Battery pack'
AND price < 300
"""
pd.read_sql(sql_expr, sqlite_engine)
The equivalent operation in pandas is:
prices[(prices['retailer'] == 'Amazon')
& ~(prices['product'] == 'Battery pack')
& (prices['price'] <= 300)]
There's a subtle difference that's worth noting: the index of the Chromebook in the SQL query is 0, whereas the corresponding index in the DataFrame is 4. This is because SQL queries always return a new table with indices counting up from 0, whereas pandas subsets a portion of the DataFrame prices and returns it with the original indices. We can use pd.DataFrame.reset_index to reset the indices in pandas.
So far, we've only worked with data from the existing rows in the table; that is, all of our returned tables have been some subset of the entries found in the table. But to conduct data analysis, we'll want to compute aggregate values over our data. In SQL, these are called aggregate functions.
If we want to find the average price of all gadgets in the prices relation:
sql_expr = """
SELECT AVG(price) AS avg_price
FROM prices
"""
pd.read_sql(sql_expr, sqlite_engine)
Equivalently, in pandas:
prices['price'].mean()
A complete list of PostgreSQL aggregate functions can be found here. Though we're using PostgreSQL as our primary version of SQL in Data 100, keep in mind that there are many other variations of SQL (MySQL, SQLite, etc.) that use different function names and have different functions available.
With aggregate functions, we can execute more complicated SQL queries. To operate on more granular aggregate data, we can use the following two clauses:
GROUP BY takes a list of columns and groups the table like the pd.DataFrame.groupby function in pandas.HAVING is functionally similar to WHERE, but is used exclusively to apply predicates to aggregated data. (Note that in order to use HAVING, it must be preceded by a GROUP BY clause.)Important: When using GROUP BY, all columns in the SELECT clause must be either listed in the GROUP BY clause or have an aggregate function applied to them.
We can use these statements to find the maximum price at each retailer.
sql_expr = """
SELECT retailer, MAX(price) as max_price
FROM prices
GROUP BY retailer
"""
pd.read_sql(sql_expr, sqlite_engine)
Let's say we have a client with expensive taste and only want to find retailers that sell gadgets over $700. Note that we must use HAVING to define predicates on aggregated columns; we can't use WHERE to filter an aggregated column. To compute a list of retailers and accompanying prices that satisfy our needs, we run:
sql_expr = """
SELECT retailer, MAX(price) as max_price
FROM prices
GROUP BY retailer
HAVING max_price > 700
"""
pd.read_sql(sql_expr, sqlite_engine)
For comparison, we recreate the same table in pandas:
max_prices = prices.groupby('retailer').max()
max_prices.loc[max_prices['price'] > 700, ['price']]
These clauses allow us to control the presentation of the data:
ORDER BY lets us present the data in lexicographic order of column values. By default, ORDER BY uses ascending order (ASC) but we can specify descending order using DESC.LIMIT controls how many tuples are displayed.Let's display the three cheapest items in our prices table:
sql_expr = """
SELECT *
FROM prices
ORDER BY price ASC
LIMIT 3
"""
pd.read_sql(sql_expr, sqlite_engine)
Note that we didn't have to include the ASC keyword since ORDER BY returns data in ascending order by default.
For comparison, in pandas:
prices.sort_values('price').head(3)
(Again, we see that the indices are out of order in the pandas DataFrame. As before, pandas returns a view on our DataFrame prices, whereas SQL is displaying a new table each time that we execute a query.)
Clauses in a SQL query are executed in a specific order. Unfortunately, this order differs from the order that the clauses are written in a SQL query. From first executed to last:
FROM: One or more source tablesWHERE: Apply selection qualifications (eliminate rows)GROUP BY: Form groups and aggregateHAVING: Eliminate groupsSELECT: Select columnsNote on WHERE vs. HAVING: Since the WHERE clause is processed before applying GROUP BY, the WHERE clause cannot make use of aggregated values. To define predicates based on aggregated values, we must use the HAVING clause.
We have introduced SQL syntax and the most important SQL statements needed to conduct data analysis using a relational database management system.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/09'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
# Make names table
sql_expr = """
CREATE TABLE names(
cat_id INTEGER PRIMARY KEY,
name TEXT);
"""
result = sqlite_engine.execute(sql_expr)
# Populate names table
sql_expr = """
INSERT INTO names VALUES
(0, "Apricot"),
(1, "Boots"),
(2, "Cally"),
(4, "Eugene");
"""
result = sqlite_engine.execute(sql_expr)
# HIDDEN
# Make colors table
sql_expr = """
CREATE TABLE colors(
cat_id INTEGER PRIMARY KEY,
color TEXT);
"""
result = sqlite_engine.execute(sql_expr)
# Populate colors table
sql_expr = """
INSERT INTO colors VALUES
(0, "orange"),
(1, "black"),
(2, "calico"),
(3, "white");
"""
result = sqlite_engine.execute(sql_expr)
# HIDDEN
# Make ages table
sql_expr = """
CREATE TABLE ages(
cat_id INTEGER PRIMARY KEY,
age INT);
"""
result = sqlite_engine.execute(sql_expr)
# Populate ages table
sql_expr = """
INSERT INTO ages VALUES
(0, 4),
(1, 3),
(2, 9),
(4, 20);
"""
result = sqlite_engine.execute(sql_expr)
In pandas we use the pd.merge method to join two tables using matching values in their columns. For example:
pd.merge(table1, table2, on='common_column')
In this section, we introduce SQL joins. SQL joins are used to combine multiple tables in a relational database.
Suppose we are cat store owners with a database for the cats we have in our store. We have two different tables: names and colors. The names table contains the columns cat_id, a unique number assigned to each cat, and name, the name for the cat. The colors table contains the columns cat_id and color, the color of each cat.
Note that there are some missing rows from both tables - a row with cat_id 3 is missing from the names table, and a row with cat_id 4 is missing from the colors table.
| cat_id | name |
|---|---|
| 0 | Apricot |
| 1 | Boots |
| 2 | Cally |
| 4 | Eugene |
| cat_id | color |
|---|---|
| 0 | orange |
| 1 | black |
| 2 | calico |
| 3 | white |
To compute the color of the cat named Apricot, we have to use information in both tables. We can join the tables on the cat_id column, creating a new table with both name and color.
A join combines tables by matching values in their columns.
There are four main types of joins: inner joins, outer joins, left joins, and right joins. Although all four combine tables, each one treats non-matching values differently.
Definition: In an inner join, the final table only contains rows that have matching columns in both tables.

Example: We would like to join the names and colors tables together to match each cat with its color. Since both tables contain a cat_id column that is the unique identifier for a cat, we can use an inner join on the cat_id column.
SQL: To write an inner join in SQL we modify our FROM clause to use the following syntax:
SELECT ...
FROM <TABLE_1>
INNER JOIN <TABLE_2>
ON <...>
For example:
SELECT *
FROM names AS N
INNER JOIN colors AS C
ON N.cat_id = C.cat_id;
| cat_id | name | cat_id | color | |
|---|---|---|---|---|
| 0 | 0 | Apricot | 0 | orange |
| 1 | 1 | Boots | 1 | black |
| 2 | 2 | Cally | 2 | calico |
You may verify that each cat name is matched with its color. Notice that the cats with cat_id 3 and 4 are not present in our resulting table because the colors table doesn't have a row with cat_id 4 and the names table doesn't have a row with cat_id 3. In an inner join, if a row doesn't have a matching value in the other table, the row is not included in the final result.
Assuming we have a DataFrame called names and a DataFrame called colors, we can conduct an inner join in pandas by writing:
pd.merge(names, colors, how='inner', on='cat_id')
Definition: In a full join (sometimes called an outer join), all records from both tables are included in the joined table. If a row doesn't have a match in the other table, the missing values are filled in with NULL.

Example: As before, we join the names and colors tables together to match each cat with its color. This time, we want to keep all rows in either table even if there isn't a match.
SQL: To write an outer join in SQL we modify our FROM clause to use the following syntax:
SELECT ...
FROM <TABLE_1>
FULL JOIN <TABLE_2>
ON <...>
For example:
SELECT name, color
FROM names N
FULL JOIN colors C
ON N.cat_id = C.cat_id;
| cat_id | name | color |
|---|---|---|
| 0 | Apricot | orange |
| 1 | Boots | black |
| 2 | Cally | calico |
| 3 | NULL | white |
| 4 | Eugene | NULL |
Notice that the final output contains the entries with cat_id 3 and 4. If a row does not have a match, it is still included in the final output and any missing values are filled in with NULL.
In pandas:
pd.merge(names, colors, how='outer', on='cat_id')
Definition: In a left join, all records from the left table are included in the joined table. If a row doesn't have a match in the right table, the missing values are filled in with NULL.

Example: As before, we join the names and colors tables together to match each cat with its color. This time, we want to keep all the cat names even if a cat doesn't have a matching color.
SQL: To write an left join in SQL we modify our FROM clause to use the following syntax:
SELECT ...
FROM <TABLE_1>
LEFT JOIN <TABLE_2>
ON <...>
For example:
SELECT name, color
FROM names N
LEFT JOIN colors C
ON N.cat_id = C.cat_id;
| cat_id | name | color |
|---|---|---|
| 0 | Apricot | orange |
| 1 | Boots | black |
| 2 | Cally | calico |
| 4 | Eugene | NULL |
Notice that the final output includes all four cat names. Three of the cat_ids in the names relation had matching cat_ids in the colors table and one did not (Eugene). The cat name that did not have a matching color has NULL as its color.
In pandas:
pd.merge(names, colors, how='left', on='cat_id')
Definition: In a right join, all records from the right table are included in the joined table. If a row doesn't have a match in the left table, the missing values are filled in with NULL.

Example: As before, we join the names and colors tables together to match each cat with its color. This time, we want to keep all the cat color even if a cat doesn't have a matching name.
SQL: To write a right join in SQL we modify our FROM clause to use the following syntax:
SELECT ...
FROM <TABLE_1>
RIGHT JOIN <TABLE_2>
ON <...>
For example:
SELECT name, color
FROM names N
RIGHT JOIN colors C
ON N.cat_id = C.cat_id;
| cat_id | name | color |
|---|---|---|
| 0 | Apricot | orange |
| 1 | Boots | black |
| 2 | Cally | calico |
| 3 | NULL | white |
This time, observe that the final output includes all four cat colors. Three of the cat_ids in the colors relation had matching cat_ids in the names table and one did not (white). The cat color that did not have a matching name has NULL as its name.
You may also notice that a right join produces the same result a left join with the table order swapped. That is, names left joined with colors is the same as colors right joined with names. Because of this, some SQL engines (such as SQLite) do not support right joins.
In pandas:
pd.merge(names, colors, how='right', on='cat_id')
There are typically multiple ways to accomplish the same task in SQL just as there are multiple ways to accomplish the same task in Python. We point out one other method for writing an inner join that appears in practice called an implicit join. Recall that we previously wrote the following to conduct an inner join:
SELECT *
FROM names AS N
INNER JOIN colors AS C
ON N.cat_id = C.cat_id;
An implicit inner join has a slightly different syntax. Notice in particular that the FROM clause uses a comma to select from two tables and that the query includes a WHERE clause to specify the join condition.
SELECT *
FROM names AS N, colors AS C
WHERE N.cat_id = C.cat_id;
When multiple tables are specified in the FROM clause, SQL creates a table containing every combination of rows from each table. For example:
sql_expr = """
SELECT *
FROM names N, colors C
"""
pd.read_sql(sql_expr, sqlite_engine)
This operation is often called a Cartesian product: each row in the first table is paired with every row in the second table. Notice that many rows contain cat colors that are not matched properly with their names. The additional WHERE clause in the implicit join filters out rows that do not have matching cat_id values.
SELECT *
FROM names AS N, colors AS C
WHERE N.cat_id = C.cat_id;
| cat_id | name | cat_id | color | |
|---|---|---|---|---|
| 0 | 0 | Apricot | 0 | orange |
| 1 | 1 | Boots | 1 | black |
| 2 | 2 | Cally | 2 | calico |
To join multiple tables, extend the FROM clause with additional JOIN operators. For example, the following table ages includes data about each cat's age.
| cat_id | age |
|---|---|
| 0 | 4 |
| 1 | 3 |
| 2 | 9 |
| 4 | 20 |
To conduct an inner join on the names, colors, and ages table, we write:
# Joining three tables
sql_expr = """
SELECT name, color, age
FROM names n
INNER JOIN colors c ON n.cat_id = c.cat_id
INNER JOIN ages a ON n.cat_id = a.cat_id;
"""
pd.read_sql(sql_expr, sqlite_engine)
We have covered the four main types of SQL joins: inner, full, left, and right joins. We use all four joins to combine information in separate relations, and each join differs only in how it handles non-matching rows in the input tables.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/10'))
Essentially, all models are wrong, but some are useful.
We have covered question formulation, data cleaning, and exploratory data analysis, the first three steps of the data science lifecycle. We have also seen that EDA often reveals relationships between variables in our dataset. How do we decide whether a relationship is real or spurious? How do we use these relationships to make reliable predictions about the future? To answer these questions we will need the mathematical tools for modeling and estimation.
A model is an idealized representation of a system. For example, if we drop a steel ball off the Leaning Tower of Pisa, a simple model of gravity states that we expect the ball to drop to the ground, accelerating at the rate of 9.8 m/s². This model also allows us to predict how long it will take the ball to hit the ground using the laws of projectile motion.
This model of gravity describes our system's behavior but is only an approximation—it leaves out the effects of air resistance, the gravitational effects of other celestial bodies, and the buoyancy of air. Because of these unconsidered factors, our model will almost always make incorrect predictions in real life! Still, the simple model of gravity is accurate enough in so many situations that it's widely used and taught today.
Similarly, any model that we define using data is an approximation of a real-world process. When the approximations are not too severe, our model has practical use. This naturally raises a few fundamental questions. How do we choose a model? How do we know whether we need a more complicated model?
In the remaining chapters of the book, we will develop computational tools to design and fit models to data. We will also introduce inferential tools that allow us to reason about our models' ability to generalize to the population of interest.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/10'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
In the United States, many diners will leave a tip for their waiter or waitress as the diners pay for the meal. Although it is customary in the US to offer 15% of the total bill as tip, perhaps some restaurants have more generous patrons than others.
One particular waiter was so interested in how much tip he could expect to get that he collected information about all the tables he served during a month of employment.
# HIDDEN
tips = sns.load_dataset('tips')
tips
We can plot a histogram of the tip amounts:
# HIDDEN
sns.distplot(tips['tip'], bins=np.arange(0, 10.1, 0.25), rug=True)
plt.xlabel('Tip Amount in Dollars')
plt.ylabel('Proportion per Dollar');
There are already some interesting patterns in the data. For example, there is a clear mode at and most tips seem to be in multiples of .
For now, we are most interested in the percent tip: the tip amount divided by the bill amount. We can create a column in our DataFrame for this variable and show its distribution.
# HIDDEN
tips['pcttip'] = tips['tip'] / tips['total_bill'] * 100
sns.distplot(tips['pcttip'], rug=True)
plt.xlabel('Percent Tip Amount')
plt.ylabel('Proportion per Percent');
It looks like one table left our waiter a tip of ! However, most of the tips percentages are under . Let's zoom into that part of the distribution.
# HIDDEN
sns.distplot(tips['pcttip'], bins=np.arange(30), rug=True)
plt.xlim(0, 30)
plt.xlabel('Percent Tip Amount')
plt.ylabel('Proportion per Percent');
We can see that the distribution is roughly centered at with another potential mode at . Suppose our waiter is interested in predicting how much percent tip he will get from a given table. To address this question, we can create a model for how much tip the waiter will get.
One simple model is ignore the data altogether and state that since the convention in the U.S. is to give 15% tip, the waiter will always get 15% tip from his tables. While extremely simple, we will use this model to define some variables that we'll use later on.
This model assumes that there is one true percentage tip that all tables, past and future, will give the waiter. This is the population parameter for the percent tip, which we will denote by .
After making this assumption, our model then says that our guess for is . We will use to represent our current guess.
In mathematical notation, our model states that:
This model is clearly problematic—if the model were true, every table in our dataset should have given the waiter exactly 15% tip. Nonetheless, this model will make a reasonable guess for many scenarios. In fact, this model might be the most useful choice if we had no other information aside from the fact that the waiter is employed in the US.
Since our waiter collected data, however, we can use his history of his tips to create a model instead of picking 15% based on convention alone.
The distribution of tip percents from our dataset is replicated below for convenience.
# HIDDEN
sns.distplot(tips['pcttip'], bins=np.arange(30), rug=True)
plt.xlim(0, 30)
plt.xlabel('Percent Tip Amount')
plt.ylabel('Proportion per Percent');
Let's suppose we are trying to compare two choices for : and . We can mark both of these choices on our distribution:
# HIDDEN
sns.distplot(tips['pcttip'], bins=np.arange(30), rug=True)
plt.axvline(x=10, c='darkblue', linestyle='--', label=r'$ \theta = 10$')
plt.axvline(x=15, c='darkgreen', linestyle='--', label=r'$ \theta = 15$')
plt.legend()
plt.xlim(0, 30)
plt.xlabel('Percent Tip Amount')
plt.ylabel('Proportion per Percent');
Intuitively, it looks like choosing makes more sense than given our dataset. Why is this? When we look at the points in our data, we can see that more points fall close to than they do to .
Although it seems apparent that is a better choice than , it is not so clear whether is a better choice than . To make precise choices between different values of , we would like to assign each value of a number that measures how "good" it is for our data. That is, we want a function that takes as input a value of and the points in our dataset, outputting a single number that we will use to select the best value of that we can.
We call this function a loss function.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/10'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
tips = sns.load_dataset('tips')
tips['pcttip'] = tips['tip'] / tips['total_bill'] * 100
Recall our assumptions thus far: we assume that there is a single population tip percentage . Our model estimates this parameter; we use the variable to denote our estimate. We would like to use the collected data on tips to determine the value that should have,
To precisely decide which value of is best, we define a loss function. A loss function is a mathematical function that takes in an estimate and the points in our dataset . It outputs a single number, the loss, that measures how well fits our data. In mathematical notation, we want to create the function:
By convention, the loss function outputs lower values for preferable values of and larger values for worse values of . To fit our model, we select the value of that produces a lower loss than all other choices of —the that minimizes the loss. We use the notation to denote the value of that minimizes a specified loss function.
Consider once again two possible values of : and .
# HIDDEN
sns.distplot(tips['pcttip'], bins=np.arange(30), rug=True)
plt.axvline(x=10, c='darkblue', linestyle='--', label=r'$ \theta = 10$')
plt.axvline(x=15, c='darkgreen', linestyle='--', label=r'$ \theta = 15$')
plt.legend()
plt.xlim(0, 30)
plt.xlabel('Percent Tip Amount')
plt.ylabel('Proportion per Percent');
Since falls closer to most of the points, our loss function should output a small value for and a larger value for .
Let's use this intuition to create a loss function.
We would like our choice of to fall close to the points in our dataset. Thus, we can define a loss function that outputs a larger value as gets further away from the points in the dataset. We start with a simple loss function called the mean squared error. Here's the idea:
This gives us a final loss function of:
Creating a Python function to compute the loss is simple:
def mse_loss(theta, y_vals):
return np.mean((y_vals - theta) ** 2)
Let's see how this loss function behaves. Suppose we have a dataset only containing one point, . We can try different values of and see what the loss function outputs for each value.
# HIDDEN
def try_thetas(thetas, y_vals, xlims, loss_fn=mse_loss, figsize=(10, 7), cols=3):
if not isinstance(y_vals, np.ndarray):
y_vals = np.array(y_vals)
rows = int(np.ceil(len(thetas) / cols))
plt.figure(figsize=figsize)
for i, theta in enumerate(thetas):
ax = plt.subplot(rows, cols, i + 1)
sns.rugplot(y_vals, height=0.1, ax=ax)
plt.axvline(theta, linestyle='--',
label=rf'$ \theta = {theta} $')
plt.title(f'Loss = {loss_fn(theta, y_vals):.2f}')
plt.xlim(*xlims)
plt.yticks([])
plt.legend()
plt.tight_layout()
try_thetas(thetas=[11, 12, 13, 14, 15, 16],
y_vals=[14], xlims=(10, 17))
You can also interactively try different values of below. You should understand why the loss for is many times higher than the loss for .
# HIDDEN
def try_thetas_interact(theta, y_vals, xlims, loss_fn=mse_loss):
if not isinstance(y_vals, np.ndarray):
y_vals = np.array(y_vals)
plt.figure(figsize=(4, 3))
sns.rugplot(y_vals, height=0.1)
plt.axvline(theta, linestyle='--')
plt.xlim(*xlims)
plt.yticks([])
print(f'Loss for theta = {theta}: {loss_fn(theta, y_vals):.2f}')
def mse_interact(theta, y_vals, xlims):
plot = interactive(try_thetas_interact, theta=theta,
y_vals=fixed(y_vals), xlims=fixed(xlims),
loss_fn=fixed(mse_loss))
plot.children[-1].layout.height = '240px'
return plot
mse_interact(theta=(11, 16, 0.5), y_vals=[14], xlims=(10, 17))
As we hoped, our loss is larger as is further away from our data and is smallest when falls exactly onto our data point. Let's now see how our mean squared error behaves when we have five points instead of one. Our data this time are: .
# HIDDEN
try_thetas(thetas=[12, 13, 14, 15, 16, 17],
y_vals=[11, 12, 15, 17, 18],
xlims=(10.5, 18.5))
Of the values of we tried has the lowest loss. However, a value of in between 14 and 15 might have an even lower loss than . See if you can find a better value of using the interactive plot below.
# HIDDEN
mse_interact(theta=(12, 17, 0.2),
y_vals=[11, 12, 15, 17, 18],
xlims=(10.5, 18.5))
The mean squared error seems to be doing its job by penalizing values of that are far away from the center of the data. Let's now see what the loss function outputs on the original dataset of tip percents. For reference, the original distribution of tip percents is plotted below:
# HIDDEN
sns.distplot(tips['pcttip'], bins=np.arange(30), rug=True)
plt.xlim(0, 30)
plt.xlabel('Percent Tip Amount')
plt.ylabel('Proportion per Percent');
Let's try some values of .
# HIDDEN
try_thetas(thetas=np.arange(14.5, 17.1, 0.5),
y_vals=tips['pcttip'],
xlims=(0, 30))
As before, we've created an interactive widget to test different values of .
# HIDDEN
mse_interact(theta=(13, 17, 0.25),
y_vals=tips['pcttip'],
xlims=(0, 30))
It looks like the best value of that we've tried so far is 16.00, slightly above our original guess of 15% tip.
We have defined our first loss function, the mean squared error (MSE). It computes high loss for values of that are further away from the center of the data. Mathematically, this loss function is defined as:
The loss function will compute different losses whenever we change either or . We've seen this happen when we tried different values of and when we added new data points (changing ).
As a shorthand, we can define the vector . Then, we can write MSE as:
So far, we have found the best value of by simply trying out a bunch of values and then picking the one with the least loss. Although this method works decently well, we can find a better method by using the properties of our loss function.
For the following example, we use a dataset containing five points: .
# HIDDEN
try_thetas(thetas=[12, 13, 14, 15, 16, 17],
y_vals=[11, 12, 15, 17, 18],
xlims=(10.5, 18.5))
In the plots above, we've used integer values in between 12 and 17. When we change , the loss seems to start high (at 10.92), decrease until , then increase again. We can see that the loss changes as changes, so let's make a plot comparing the loss to for each of the six s we've tried.
# HIDDEN
thetas = np.array([12, 13, 14, 15, 16, 17])
y_vals = np.array([11, 12, 15, 17, 18])
losses = [mse_loss(theta, y_vals) for theta in thetas]
plt.scatter(thetas, losses)
plt.title(r'Loss vs. $ \theta $ when $\bf{y}$$ = [11, 12, 15, 17, 18] $')
plt.xlabel(r'$ \theta $ Values')
plt.ylabel('Loss');
The scatter plot shows the downward, then upward trend that we noticed before. We can try more values of to see a complete curve that shows how the loss changes as changes.
# HIDDEN
thetas = np.arange(12, 17.1, 0.05)
y_vals = np.array([11, 12, 15, 17, 18])
losses = [mse_loss(theta, y_vals) for theta in thetas]
plt.plot(thetas, losses)
plt.title(r'Loss vs. $ \theta $ when $\bf{y}$$ = [11, 12, 15, 17, 18] $')
plt.xlabel(r'$ \theta $ Values')
plt.ylabel('Loss');
The plot above shows that in fact, was not the best choice; a between 14 and 15 would have gotten a lower loss. We can use calculus to find that minimizing value of exactly. At the minimum loss, the derivative of the loss function with respect to is 0.
First, we start with our loss function:
Next, we plug in our points :
To find the value of that minimizes this function, we compute the derivative with respect to :
Then, we find the value of where the derivative is zero:
We've found the minimizing , and as expected, it is between 14 and 15. We denote the that minimizes the loss . Thus, for the dataset and the MSE loss function:
If we happen to compute the mean of the data values, we notice a curious equivalence:
As it turns out, the equivalence above is no mere coincidence; the average of the data values always produces , the that minimizes the MSE loss.
To show this, we take the derivative of our loss function once more. Instead of plugging in points, we leave the terms intact to generalize to other datasets.
Since we did not substitute in specific values for , this equation can be used with any dataset with any number of points.
Now, we set the derivative equal to zero and solve for to find the minimizing value of as before:
Lo and behold, we see that there is a single value of that gives the least MSE no matter what the dataset is. For the mean squared error, we know that is the mean of the dataset values.
We no longer have to test out different values of as we did before. We can compute the mean tip percentage in one go:
np.mean(tips['pcttip'])
# HIDDEN
sns.distplot(tips['pcttip'], bins=np.arange(30), rug=True)
plt.axvline(x=16.08, c='darkblue', linestyle='--', label=r'$ \hat \theta = 16.08$')
plt.legend()
plt.xlim(0, 30)
plt.title('Distribution of tip percent')
plt.xlabel('Percent Tip Amount')
plt.ylabel('Proportion per Percent');
We have introduced a constant model, a model that outputs the same number for all entries in the dataset.
A loss function measures how well a given value of fits the data. In this section, we introduce the mean squared error loss function and showed that for the constant model.
The steps we took in this section apply to many modeling scenarios:
In this book, all of our modeling techniques expand upon one or more of these steps. We introduce new models (1), new loss functions (2), and new techniques for minimizing loss (3).
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/10'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
tips = sns.load_dataset('tips')
tips['pcttip'] = tips['tip'] / tips['total_bill'] * 100
# HIDDEN
def mse_loss(theta, y_vals):
return np.mean((y_vals - theta) ** 2)
def abs_loss(theta, y_vals):
return np.mean(np.abs(y_vals - theta))
# HIDDEN
def compare_mse_abs(thetas, y_vals, xlims, figsize=(10, 7), cols=3):
if not isinstance(y_vals, np.ndarray):
y_vals = np.array(y_vals)
rows = int(np.ceil(len(thetas) / cols))
plt.figure(figsize=figsize)
for i, theta in enumerate(thetas):
ax = plt.subplot(rows, cols, i + 1)
sns.rugplot(y_vals, height=0.1, ax=ax)
plt.axvline(theta, linestyle='--',
label=rf'$ \theta = {theta} $')
plt.title(f'MSE = {mse_loss(theta, y_vals):.2f}\n'
f'MAE = {abs_loss(theta, y_vals):.2f}')
plt.xlim(*xlims)
plt.yticks([])
plt.legend()
plt.tight_layout()
To fit a model, we select a loss function and select the model parameters that minimize the loss. In the previous section, we introduced the mean squared error (MSE) loss function:
We used a constant model that predicts the same number for all entries in the dataset. When we fit this model using the MSE loss, we found that . On the tips dataset, we found that a fitted constant model will predict since is the mean of the tip percents.
In this section, we introduce two new loss functions, the mean absolute error loss function and the Huber loss function.
Now, we will keep our model the same but switch to a different loss function: the mean absolute error (MAE). Instead taking the squared difference for each point and our prediction, this loss function takes the absolute difference:
To get a better sense of how the MSE and MAE compare, let's compare their losses on different datasets. First, we'll use our dataset of one point: .
# HIDDEN
compare_mse_abs(thetas=[11, 12, 13, 14, 15, 16],
y_vals=[14], xlims=(10, 17))
We see that the MSE is usually higher than the MAE since the error is squared. Let's see what happens when have five points:
# HIDDEN
compare_mse_abs(thetas=[12, 13, 14, 15, 16, 17],
y_vals=[12.1, 12.8, 14.9, 16.3, 17.2],
xlims=(11, 18))
Remember that the actual loss values themselves are not very interesting to us; they are only useful for comparing different values of . Once we choose a loss function, we will look for , the that produces the least loss. Thus, we are interested in whether the loss functions produce different .
So far, the two loss functions seem to agree on . If we look a bit closer, however, we will start to see some differences. We first take the losses and plot them against for each of the six values we tried.
# HIDDEN
thetas = np.array([12, 13, 14, 15, 16, 17])
y_vals = np.array([12.1, 12.8, 14.9, 16.3, 17.2])
mse_losses = [mse_loss(theta, y_vals) for theta in thetas]
abs_losses = [abs_loss(theta, y_vals) for theta in thetas]
plt.scatter(thetas, mse_losses, label='MSE')
plt.scatter(thetas, abs_losses, label='MAE')
plt.title(r'Loss vs. $ \theta $ when $ \bf{y}$$= [ 12.1, 12.8, 14.9, 16.3, 17.2 ] $')
plt.xlabel(r'$ \theta $ Values')
plt.ylabel('Loss')
plt.legend();
Then, we compute more values of so that the curve is smooth:
# HIDDEN
thetas = np.arange(12, 17.1, 0.05)
y_vals = np.array([12.1, 12.8, 14.9, 16.3, 17.2])
mse_losses = [mse_loss(theta, y_vals) for theta in thetas]
abs_losses = [abs_loss(theta, y_vals) for theta in thetas]
plt.plot(thetas, mse_losses, label='MSE')
plt.plot(thetas, abs_losses, label='MAE')
plt.title(r'Loss vs. $ \theta $ when $ \bf{y}$$ = [ 12.1, 12.8, 14.9, 16.3, 17.2 ] $')
plt.xlabel(r'$ \theta $ Values')
plt.ylabel('Loss')
plt.legend();
Then, we zoom into the region between 1.5 and 5 on the y-axis to see the difference in minima more clearly. We've marked the minima with dotted lines.
# HIDDEN
thetas = np.arange(12, 17.1, 0.05)
y_vals = np.array([12.1, 12.8, 14.9, 16.3, 17.2])
mse_losses = [mse_loss(theta, y_vals) for theta in thetas]
abs_losses = [abs_loss(theta, y_vals) for theta in thetas]
plt.figure(figsize=(7, 5))
plt.plot(thetas, mse_losses, label='MSE')
plt.plot(thetas, abs_losses, label='MAE')
plt.axvline(np.mean(y_vals), c=sns.color_palette()[0], linestyle='--',
alpha=0.7, label='Minimum MSE')
plt.axvline(np.median(y_vals), c=sns.color_palette()[1], linestyle='--',
alpha=0.7, label='Minimum MAE')
plt.title(r'Loss vs. $ \theta $ when $ \bf{y}$$ = [ 12.1, 12.8, 14.9, 16.3, 17.2 ] $')
plt.xlabel(r'$ \theta $ Values')
plt.ylabel('Loss')
plt.ylim(1.5, 5)
plt.legend()
plt.tight_layout();
We've found empirically that the MSE and MAE can produce different for the same dataset. A closer analysis reveals when they will differ and more importantly, why they differ.
One difference that we can see in the plots of loss vs. above lies in the shape of the loss curves. Plotting the MSE results in a parabolic curve resulting from the squared term in the loss function.
Plotting the MAE, on the other hand, results in what looks like a connected series of lines. This makes sense when we consider that the absolute value function is linear, so taking the average of many absolute value functions should produce a semi-linear function.
Since the MSE has a squared error term, it will be more sensitive to outliers. If and a point lies at 110, that point's error term for MSE will be whereas in the MAE, that point's error term will be . We can illustrate this by taking a set of three points and plotting the loss vs. curves for MSE and MAE.
Use the slider below to move the third point further away from the rest of the data and observe what happens to the loss curves. (We've scaled the curves to keep both in view since the MSE has larger values than the MAE.)
# HIDDEN
def compare_mse_abs_curves(y3=14):
thetas = np.arange(11.5, 26.5, 0.1)
y_vals = np.array([12, 13, y3])
mse_losses = [mse_loss(theta, y_vals) for theta in thetas]
abs_losses = [abs_loss(theta, y_vals) for theta in thetas]
mse_abs_diff = min(mse_losses) - min(abs_losses)
mse_losses = [loss - mse_abs_diff for loss in mse_losses]
plt.figure(figsize=(9, 2))
ax = plt.subplot(121)
sns.rugplot(y_vals, height=0.3, ax=ax)
plt.xlim(11.5, 26.5)
plt.xlabel('Points')
ax = plt.subplot(122)
plt.plot(thetas, mse_losses, label='MSE')
plt.plot(thetas, abs_losses, label='MAE')
plt.xlim(11.5, 26.5)
plt.ylim(min(abs_losses) - 1, min(abs_losses) + 10)
plt.xlabel(r'$ \theta $')
plt.ylabel('Loss')
plt.legend()
# HIDDEN
interact(compare_mse_abs_curves, y3=(14, 25));
We've shown the curves for and below.
# HIDDEN
compare_mse_abs_curves(y3=14)
# HIDDEN
compare_mse_abs_curves(y3=25)
As we move the point further away from the rest of the data, the MSE curve moves with it. When , both MSE and MAE have . However, when , the MSE loss produces while the MAE produces , unchanged from before.
Now that we have a qualitative sense of how the MSE and MAE differ, we can minimize the MAE to make this difference more precise. As before, we will take the derivative of the loss function with respect to and set it equal to zero.
This time, however, we have to deal with the fact that the absolute function is not always differentiable. When , . When , . Although is not technicaly differentiable at , we will set so that the equations are easier to work with.
Recall that the equation for the MAE is:
In the line above, we've split up the summation into three separate summations: one that has one term for each , one for , and one for . Why make the summation seemingly more complicated? If we know that we also know that and thus from before. A similar logic holds for each term above to make taking the derivative much easier.
Now, we take the derivative with respect to and set it equal to zero:
What does the result above mean? On the left hand side, we have one term for each data point less than . On the right, we have one for each data point greater than . Then, in order to satisfy the equation we need to pick a value for that has the same number of smaller and larger points. This is the definition for the median of a set of numbers. Thus, the minimizing value of for the MAE is .
When we have an odd number of points, the median is simply the middle point when the points are arranged in sorted order. We can see that in the example below with five points, the loss is minimized when lies at the median:
# HIDDEN
def points_and_loss(y_vals, xlim, loss_fn=abs_loss):
thetas = np.arange(xlim[0], xlim[1] + 0.01, 0.05)
abs_losses = [loss_fn(theta, y_vals) for theta in thetas]
plt.figure(figsize=(9, 2))
ax = plt.subplot(121)
sns.rugplot(y_vals, height=0.3, ax=ax)
plt.xlim(*xlim)
plt.xlabel('Points')
ax = plt.subplot(122)
plt.plot(thetas, abs_losses)
plt.xlim(*xlim)
plt.xlabel(r'$ \theta $')
plt.ylabel('Loss')
points_and_loss(np.array([10, 11, 12, 14, 15]), (9, 16))
However, when we have an even number of points, the loss is minimized when is any value in between the two central points.
# HIDDEN
points_and_loss(np.array([10, 11, 14, 15]), (9, 16))
This is not the case when we use the MSE:
# HIDDEN
points_and_loss(np.array([10, 11, 14, 15]), (9, 16), mse_loss)
Our investigation and the derivation above show that the MSE is easier to differentiate but is more sensitive to outliers than the MAE. For the MSE, , while for the MAE . Notice that the median is less affected by outliers than the mean. This phenomenon arises from our construction of the two loss functions.
We have also seen that the MSE has a unique , whereas the mean absolute value can multiple possible values when there are an even number of data points.
A third loss function called the Huber loss combines both the MSE and MAE to create a loss function that is differentiable and robust to outliers. The Huber loss accomplishes this by behaving like the MSE function for values close to the minimum and switching to the absolute loss for values far from the minimum.
As usual, we create a loss function by taking the mean of the Huber losses for each point in our dataset.
Let's see what the Huber loss function outputs for a dataset of as we vary :
# HIDDEN
def huber_loss(est, y_obs, alpha = 1):
d = np.abs(est - y_obs)
return np.where(d < alpha,
(est - y_obs)**2 / 2.0,
alpha * (d - alpha / 2.0))
thetas = np.linspace(0, 50, 200)
loss = huber_loss(thetas, np.array([14]), alpha=5)
plt.plot(thetas, loss, label="Huber Loss")
plt.vlines(np.array([14]), -20, -5,colors="r", label="Observation")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend()
plt.savefig('huber_loss.pdf')
We can see that the Huber loss is smooth, unlike the MAE. The Huber loss also increases at a linear rate, unlike the quadratic rate of the mean squared loss.
The Huber loss does have a drawback, however. Notice that it transitions from the MSE to the MAE once gets far enough from the point. We can tweak this "far enough" to get different loss curves. For example, we can make it transition once is just one unit away from the observation:
# HIDDEN
loss = huber_loss(thetas, np.array([14]), alpha=1)
plt.plot(thetas, loss, label="Huber Loss")
plt.vlines(np.array([14]), -20, -5,colors="r", label="Observation")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend()
plt.savefig('huber_loss.pdf')
Or we can make it transition when is ten units away from the observation:
# HIDDEN
loss = huber_loss(thetas, np.array([14]), alpha=10)
plt.plot(thetas, loss, label="Huber Loss")
plt.vlines(np.array([14]), -20, -5,colors="r", label="Observation")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend()
plt.savefig('huber_loss.pdf')
This choice results in a different loss curve and can thus result in different values of . If we want to use the Huber loss function, we have the additional task of setting this transition point to a suitable value.
The Huber loss function is defined mathematically as follows:
It is more complex than the previous loss functions because it combines both MSE and MAE. The additional parameter sets the point where the Huber loss transitions from the MSE to the absolute loss.
Attempting to take the derivative of the Huber loss function is tedious and does not result in an elegant result like the MSE and MAE. Instead, we can use a computational method called gradient descent to find minimizing value of .
In this section, we introduce two loss functions: the mean absolute error and the Huber loss functions. We show for a constant model fitted using the MAE, .
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/11'))
In order to use a dataset for estimation and prediction, we need to precisely define our model and select a loss function. For example, in the tip percentage dataset, our model assumed that there was a single tip percentage that does not vary by table. Then, we decided to use the mean squared error loss function and found the model that minimized the loss function.
We also found that there are simple expressions that minimize the MSE and the mean absolute error loss functions: the mean and the median. However, as our models and loss functions become more complicated we will no longer be able to find useful algebraic expressions for the models that minimize the loss. For example, the Huber loss has useful properties but is difficult to differentiate by hand.
We can use the computer to address this issue using gradient descent, a computational method of minimizing loss functions.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/11'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def mse(theta, y_vals):
return np.mean((y_vals - theta) ** 2)
def points_and_loss(y_vals, xlim, loss_fn):
thetas = np.arange(xlim[0], xlim[1] + 0.01, 0.05)
losses = [loss_fn(theta, y_vals) for theta in thetas]
plt.figure(figsize=(9, 2))
ax = plt.subplot(121)
sns.rugplot(y_vals, height=0.3, ax=ax)
plt.xlim(*xlim)
plt.title('Points')
plt.xlabel('Tip Percent')
ax = plt.subplot(122)
plt.plot(thetas, losses)
plt.xlim(*xlim)
plt.title(loss_fn.__name__)
plt.xlabel(r'$ \theta $')
plt.ylabel('Loss')
plt.legend()
Let us return to our constant model:
We will use the mean squared error loss function:
For simplicity, we will use the dataset . We know from our analytical approach in a previous chapter that the minimizing for the MSE is . Let's see whether we can find the same value by writing a program.
If we write the program well, we will be able to use the same program on any loss function in order to find the minimizing value of , including the mathematically complicated Huber loss:
First, we create a rug plot of the data points. To the right of the rug plot we plot the MSE for different values of .
# HIDDEN
pts = np.array([12, 13, 15, 16, 17])
points_and_loss(pts, (11, 18), mse)
How might we write a program to automatically find the minimizing value of ? The simplest method is to compute the loss for many values . Then, we can return the value that resulted in the least loss.
We define a function called simple_minimize that takes in a loss function, an array of data points, and an array of values to try.
def simple_minimize(loss_fn, dataset, thetas):
'''
Returns the value of theta in thetas that produces the least loss
on a given dataset.
'''
losses = [loss_fn(theta, dataset) for theta in thetas]
return thetas[np.argmin(losses)]
Then, we can define a function to compute the MSE and pass it into simple_minimize.
def mse(theta, dataset):
return np.mean((dataset - theta) ** 2)
dataset = np.array([12, 13, 15, 16, 17])
thetas = np.arange(12, 18, 0.1)
simple_minimize(mse, dataset, thetas)
This is close to the expected value:
# Compute the minimizing theta using the analytical formula
np.mean(dataset)
Now, we can define a function to compute the Huber loss and plot the loss against .
def huber_loss(theta, dataset, alpha = 1):
d = np.abs(theta - dataset)
return np.mean(
np.where(d < alpha,
(theta - dataset)**2 / 2.0,
alpha * (d - alpha / 2.0))
)
# HIDDEN
points_and_loss(pts, (11, 18), huber_loss)
Although we can see that the minimizing value of should be close to 15, we do not have an analytical method of finding directly for the Huber loss. Instead, we can use our simple_minimize function.
simple_minimize(huber_loss, dataset, thetas)
Now, we can return to our original dataset of tip percentages and find the best value for using the Huber loss.
tips = sns.load_dataset('tips')
tips['pcttip'] = tips['tip'] / tips['total_bill'] * 100
tips.head()
# HIDDEN
points_and_loss(tips['pcttip'], (11, 20), huber_loss)
simple_minimize(huber_loss, tips['pcttip'], thetas)
We can see that using the Huber loss gives us . We can now compare the minimizing values for MSE, MAE, and Huber loss.
print(f" MSE: theta_hat = {tips['pcttip'].mean():.2f}")
print(f" MAE: theta_hat = {tips['pcttip'].median():.2f}")
print(f" Huber loss: theta_hat = 15.50")
We can see that the Huber loss is closer to the MAE since it is less affected by the outliers on the right side of the tip percentage distribution:
sns.distplot(tips['pcttip'], bins=50);
simple_minimize¶Although simple_minimize allows us to minimize loss functions, it has some flaws that make it unsuitable for general purpose use. Its primary issue is that it only works with predetermined values of to test. For example, in this code snippet we used above, we had to manually define values in between 12 and 18.
dataset = np.array([12, 13, 15, 16, 17])
thetas = np.arange(12, 18, 0.1)
simple_minimize(mse, dataset, thetas)
How did we know to examine the range between 12 and 18? We had to inspect the plot of the loss function manually and see that there was a minima in that range. This process becomes impractical as we add extra complexity to our models. In addition, we manually specified a step size of 0.1 in the code above. However, if the optimal value of were 12.043, our simple_minimize function would round to 12.00, the nearest multiple of 0.1.
We can solve both of these issues at once by using a method called gradient descent.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/11'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
tips = sns.load_dataset('tips')
tips['pcttip'] = tips['tip'] / tips['total_bill'] * 100
# HIDDEN
def mse(theta, y_vals):
return np.mean((y_vals - theta) ** 2)
def grad_mse(theta, y_vals):
return -2 * np.mean(y_vals - theta)
def plot_loss(y_vals, xlim, loss_fn):
thetas = np.arange(xlim[0], xlim[1] + 0.01, 0.05)
losses = [loss_fn(theta, y_vals) for theta in thetas]
plt.figure(figsize=(5, 3))
plt.plot(thetas, losses, zorder=1)
plt.xlim(*xlim)
plt.title(loss_fn.__name__)
plt.xlabel(r'$ \theta $')
plt.ylabel('Loss')
def plot_theta_on_loss(y_vals, theta, loss_fn, **kwargs):
loss = loss_fn(theta, y_vals)
default_args = dict(label=r'$ \theta $', zorder=2,
s=200, c=sns.xkcd_rgb['green'])
plt.scatter([theta], [loss], **{**default_args, **kwargs})
def plot_tangent_on_loss(y_vals, theta, loss_fn, eps=1e-6):
slope = ((loss_fn(theta + eps, y_vals) - loss_fn(theta - eps, y_vals))
/ (2 * eps))
xs = np.arange(theta - 1, theta + 1, 0.05)
ys = loss_fn(theta, y_vals) + slope * (xs - theta)
plt.plot(xs, ys, zorder=3, c=sns.xkcd_rgb['green'], linestyle='--')
We are interested in creating a function that can minimize a loss function without forcing the user to predetermine which values of to try. In other words, while the simple_minimize function has the following signature:
simple_minimize(loss_fn, dataset, thetas)
We would like a function that has the following signature:
minimize(loss_fn, dataset)
This function needs to automatically find the minimizing value no matter how small or large it is. We will use a technique called gradient descent to implement this new minimize function.
As with loss functions, we will discuss the intuition for gradient descent first, then formalize our understanding with mathematics.
Since the minimize function is not given values of to try, we start by picking a anywhere we'd like. Then, we can iteratively improve the estimate of . To improve an estimate of , we look at the slope of the loss function at that choice of .
For example, suppose we are using MSE for the simple dataset and our current choice of is 12.
# HIDDEN
pts = np.array([12.1, 12.8, 14.9, 16.3, 17.2])
plot_loss(pts, (11, 18), mse)
plot_theta_on_loss(pts, 12, mse)
We'd like to choose a new value for that decreases the loss. To do this, we look at the slope of the loss function at :
# HIDDEN
pts = np.array([12.1, 12.8, 14.9, 16.3, 17.2])
plot_loss(pts, (11, 18), mse)
plot_tangent_on_loss(pts, 12, mse)
The slope is negative, which means that increasing will decrease the loss.
If on the other hand, the slope of the loss function is positive:
# HIDDEN
pts = np.array([12.1, 12.8, 14.9, 16.3, 17.2])
plot_loss(pts, (11, 18), mse)
plot_tangent_on_loss(pts, 16.5, mse)
When the slope is positive, decreasing will decrease the loss.
The slope of the tangent line tells us which direction to move in order to decrease the loss. If the slope is negative, we want to move in the positive direction. If the slope is positive, should move in the negative direction. Mathematically, we write:
Where is the current estimate and is the next estimate.
For the MSE, we have:
When , we can compute . Thus, .
We've plotted the old value of as a green outlined circle and the new value as a filled in circle on the loss curve below.
# HIDDEN
pts = np.array([12.1, 12.8, 14.9, 16.3, 17.2])
plot_loss(pts, (11, 18), mse)
plot_theta_on_loss(pts, 12, mse, c='none',
edgecolor=sns.xkcd_rgb['green'], linewidth=2)
plot_theta_on_loss(pts, 17.32, mse)
Although went in the right direction, it ended up as far away from the minimum as it started. We can remedy this by multiplying the slope by a small constant before subtracting it from . Our final update formula is:
where is a small constant. For example, if we set , this is the new :
# HIDDEN
def plot_one_gd_iter(y_vals, theta, loss_fn, grad_loss, alpha=0.3):
new_theta = theta - alpha * grad_loss(theta, y_vals)
plot_loss(pts, (11, 18), loss_fn)
plot_theta_on_loss(pts, theta, loss_fn, c='none',
edgecolor=sns.xkcd_rgb['green'], linewidth=2)
plot_theta_on_loss(pts, new_theta, loss_fn)
print(f'old theta: {theta}')
print(f'new theta: {new_theta}')
# HIDDEN
plot_one_gd_iter(pts, 12, mse, grad_mse)
Here are the values for successive iterations of this process. Notice that changes more slowly as it gets closer to the minimum loss because the slope is also smaller.
# HIDDEN
plot_one_gd_iter(pts, 13.60, mse, grad_mse)
# HIDDEN
plot_one_gd_iter(pts, 14.24, mse, grad_mse)
# HIDDEN
plot_one_gd_iter(pts, 14.49, mse, grad_mse)
We now have the full algorithm for gradient descent:
You will more commonly see the gradient in place of the partial derivative . The two notations are essentially equivalent, but since the gradient notation is more common we will use it in the gradient update formula from now on:
To review notation:
You can now see the importance of choosing a differentiable loss function: is a crucial part of the gradient descent algorithm. (While it is possible to estimate the gradient by computing the difference in loss for two slightly different values of and dividing by the distance between values, this typically increases the runtime of gradient descent so significantly that it becomes impractical to use.)
The gradient algorithm is simple yet powerful since we can use it for many types of models and many types of loss functions. It is the computational tool of choice for fitting many important models, including linear regression on large datasets and neural networks.
minimize Function¶Now we return to our original task: defining the minimize function. We will have to change our function signature slightly since we now need to compute the gradient of the loss function.
def minimize(loss_fn, grad_loss_fn, dataset, alpha=0.2, progress=True):
'''
Uses gradient descent to minimize loss_fn. Returns the minimizing value of
theta_hat once theta_hat changes less than 0.001 between iterations.
'''
theta = 0
while True:
if progress:
print(f'theta: {theta:.2f} | loss: {loss_fn(theta, dataset):.2f}')
gradient = grad_loss_fn(theta, dataset)
new_theta = theta - alpha * gradient
if abs(new_theta - theta) < 0.001:
return new_theta
theta = new_theta
Then we can define functions to compute our MSE and its gradient:
def mse(theta, y_vals):
return np.mean((y_vals - theta) ** 2)
def grad_mse(theta, y_vals):
return -2 * np.mean(y_vals - theta)
Finally, we can use the minimize function to compute the minimizing value of for .
%%time
theta = minimize(mse, grad_mse, np.array([12.1, 12.8, 14.9, 16.3, 17.2]))
print(f'Minimizing theta: {theta}')
print()
We can see that gradient descent quickly finds the same solution as the analytic method:
np.mean([12.1, 12.8, 14.9, 16.3, 17.2])
Now, we can apply gradient descent to minimize the Huber loss on our dataset of tip percentages.
The Huber loss is:
The gradient of the Huber loss is:
$$ \nabla{\theta} L\delta(\theta, \textbf{y}) = \frac{1}{n} \sum_{i=1}^n \begin{cases} -(y_i - \theta) & | y_i - \theta | \le \delta \
- \delta \cdot \text{sign} (y_i - \theta) & \text{otherwise}
\end{cases} $$
(Note that in previous definitions of Huber loss we used the variable to denote the transition point. To avoid confusion with the used in gradient descent, we replace the transition point parameter of the Huber loss with .)
def huber_loss(theta, dataset, delta = 1):
d = np.abs(theta - dataset)
return np.mean(
np.where(d <= delta,
(theta - dataset)**2 / 2.0,
delta * (d - delta / 2.0))
)
def grad_huber_loss(theta, dataset, delta = 1):
d = np.abs(theta - dataset)
return np.mean(
np.where(d <= delta,
-(dataset - theta),
-delta * np.sign(dataset - theta))
)
Let's minimize the Huber loss on the tips dataset:
%%time
theta = minimize(huber_loss, grad_huber_loss, tips['pcttip'], progress=False)
print(f'Minimizing theta: {theta}')
print()
Gradient descent gives us a generic way to minimize a loss function when we cannot solve for the minimizing value of analytically. As our models and loss functions increase in complexity, we will turn to gradient descent as our tool of choice to fit models.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/11'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
tips = sns.load_dataset('tips')
tips['pcttip'] = tips['tip'] / tips['total_bill'] * 100
# HIDDEN
def mse(theta, y_vals):
return np.mean((y_vals - theta) ** 2)
def abs_loss(theta, y_vals):
return np.mean(np.abs(y_vals - theta))
def quartic_loss(theta, y_vals):
return np.mean(1/5000 * (y_vals - theta + 12) * (y_vals - theta + 23)
* (y_vals - theta - 14) * (y_vals - theta - 15) + 7)
def grad_quartic_loss(theta, y_vals):
return -1/2500 * (2 *(y_vals - theta)**3 + 9*(y_vals - theta)**2
- 529*(y_vals - theta) - 327)
def plot_loss(y_vals, xlim, loss_fn):
thetas = np.arange(xlim[0], xlim[1] + 0.01, 0.05)
losses = [loss_fn(theta, y_vals) for theta in thetas]
plt.figure(figsize=(5, 3))
plt.plot(thetas, losses, zorder=1)
plt.xlim(*xlim)
plt.title(loss_fn.__name__)
plt.xlabel(r'$ \theta $')
plt.ylabel('Loss')
def plot_theta_on_loss(y_vals, theta, loss_fn, **kwargs):
loss = loss_fn(theta, y_vals)
default_args = dict(label=r'$ \theta $', zorder=2,
s=200, c=sns.xkcd_rgb['green'])
plt.scatter([theta], [loss], **{**default_args, **kwargs})
def plot_connected_thetas(y_vals, theta_1, theta_2, loss_fn, **kwargs):
plot_theta_on_loss(y_vals, theta_1, loss_fn)
plot_theta_on_loss(y_vals, theta_2, loss_fn)
loss_1 = loss_fn(theta_1, y_vals)
loss_2 = loss_fn(theta_2, y_vals)
plt.plot([theta_1, theta_2], [loss_1, loss_2])
# HIDDEN
def plot_one_gd_iter(y_vals, theta, loss_fn, grad_loss, alpha=2.5):
new_theta = theta - alpha * grad_loss(theta, y_vals)
plot_loss(pts, (-23, 25), loss_fn)
plot_theta_on_loss(pts, theta, loss_fn, c='none',
edgecolor=sns.xkcd_rgb['green'], linewidth=2)
plot_theta_on_loss(pts, new_theta, loss_fn)
print(f'old theta: {theta}')
print(f'new theta: {new_theta[0]}')
Gradient descent provides a general method for minimizing a function. As we observe for the Huber loss, gradient descent is especially useful when the function's minimum is difficult to find analytically.
Unfortunately, gradient descent does not always find the globally minimizing . Consider the following gradient descent run using an initial on the loss function below.
# HIDDEN
pts = np.array([0])
plot_loss(pts, (-23, 25), quartic_loss)
plot_theta_on_loss(pts, -21, quartic_loss)
# HIDDEN
plot_one_gd_iter(pts, -21, quartic_loss, grad_quartic_loss)
# HIDDEN
plot_one_gd_iter(pts, -9.9, quartic_loss, grad_quartic_loss)
# HIDDEN
plot_one_gd_iter(pts, -12.6, quartic_loss, grad_quartic_loss)
# HIDDEN
plot_one_gd_iter(pts, -14.2, quartic_loss, grad_quartic_loss)
On this loss function and value, gradient descent converges to , producing a loss of roughly 8. However, the global minimum for this loss function is , corresponding to a loss of nearly zero. From this example, we observe that gradient descent finds a local minimum which may not necessarily have the same loss as the global minimum.
Luckily, a number of useful loss functions have identical local and global minima. Consider the familiar mean squared error loss function, for example:
# HIDDEN
pts = np.array([-2, -1, 1])
plot_loss(pts, (-5, 5), mse)
Running gradient descent on this loss function with an appropriate learning rate will always find the globally optimal since the sole local minimum is also the global minimum.
The mean absolute error sometimes has multiple local minima. However, all the local minima produce the globally lowest loss possible.
# HIDDEN
pts = np.array([-1, 1])
plot_loss(pts, (-5, 5), abs_loss)
On this loss function, gradient descent will converge to one of the local minima in the range . Since all of these local minima have the lowest loss possible for this function, gradient descent will still return an optimal choice of .
For some functions, any local minimum is also a global minimum. This set of functions are called convex functions since they curve upward. For a constant model, the MSE, MAE, and Huber loss are all convex.
With an appropriate learning rate, gradient descent finds the globally optimal for convex loss functions. Because of this useful property, we prefer to fit our models using convex loss functions unless we have a good reason not to.
Formally, a function is convex if and only if it satisfies the following inequality for all possible function inputs and , for all :
This inequality states that all lines connecting two points of the function must reside on or above the function itself. For the loss function at the start of the section, we can easily find such a line that appears below the graph:
# HIDDEN
pts = np.array([0])
plot_loss(pts, (-23, 25), quartic_loss)
plot_connected_thetas(pts, -12, 12, quartic_loss)
Thus, this loss function is non-convex.
For MSE, all lines connecting two points of the graph appear above the graph. We plot one such line below.
# HIDDEN
pts = np.array([0])
plot_loss(pts, (-23, 25), mse)
plot_connected_thetas(pts, -12, 12, mse)
The mathematical definition of convexity gives us a precise way of determining whether a function is convex. In this textbook, we will omit mathematical proofs of convexity and will instead state whether a chosen loss function is convex.
For a convex function, any local minimum is also a global minimum. This useful property allows gradient descent to efficiently find the globally optimal model parameters for a given loss function. While gradient descent will converge to a local minimum for non-convex loss functions, these local minima are not guaranteed to be globally optimal.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/11'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
In this section, we discuss a modification to gradient descent that makes it much more useful for large datasets. The modified algorithm is called stochastic gradient descent.
Recall gradient descent updates our model parameter by using the gradient of our chosen loss function. Specifically, we used this gradient update formula:
In this equation:
In the expression above, we calculate using the average gradient of the loss function using the entire dataset. In other words, each time we update we consult all the other points in our dataset as a complete batch. For this reason, the gradient update rule above is often referred to as batch gradient descent.
Unfortunately, we often work with large datasets. Although batch gradient descent will often find an optimal in relatively few iterations, each iteration will take a long time to compute if the training set contains many points.
To circumvent the difficulty of computing a gradient across the entire training set, stochastic gradient descent approximates the overall gradient using a single randomly chosen data point. Since the observation is chosen randomly, we expect that using the gradient at each individual observation will eventually converge to the same parameters as batch gradient descent.
Consider once again the formula for batch gradient descent:
In this formula, we have the term , the average gradient of the loss function across all points in the training set. That is:
Where is the loss at a single point in the training set. To conduct stochastic gradient descent, we simply replace the average gradient with the gradient at a single point. The gradient update formula for stochastic gradient descent is:
In this formula, is chosen randomly from . Note that choosing the points randomly is critical to the success of stochastic gradient descent! If the points are not chosen randomly, stochastic gradient descent may produce significantly worse results than batch gradient descent.
We most commonly run stochastic gradient descent by shuffling the data points and using each one in its shuffled order until one complete pass through the training data is completed. If the algorithm hasn't converged, we reshuffle the points and run another pass through the data. Each iteration of stochastic gradient descent looks at one data point; each complete pass through the data is called an epoch.
As an example, we derive the stochastic gradient descent update formula for the mean squared loss. Recall the definition of the mean squared loss:
Taking the gradient with respect to , we have:
Since the above equation gives us the average gradient loss across all points in the dataset, the gradient loss on a single point is simply the piece of the equation that is being averaged:
Thus, the batch gradient update rule for the MSE loss is:
And the stochastic gradient update rule is:
Since stochastic descent only examines a single data point a time, it will likely update less accurately than a update from batch gradient descent. However, since stochastic gradient descent computes updates much faster than batch gradient descent, stochastic gradient descent can make significant progress towards the optimal by the time batch gradient descent finishes a single update.
In the image below, we show successive updates to using batch gradient descent. The darkest area of the plot corresponds to the optimal value of on our training data, .
(This image technically shows a model that has two parameters, but it is more important to see that batch gradient descent always takes a step towards .)

Stochastic gradient descent, on the other hand, often takes steps away from ! However, since it makes updates more often, it often converges faster than batch gradient descent.

As we previously did for batch gradient descent, we define a function that computes the stochastic gradient descent of the loss function. It will be similar to our minimize function but we will need to implement the random selection of one observation at each iteration.
def minimize_sgd(loss_fn, grad_loss_fn, dataset, alpha=0.2):
"""
Uses stochastic gradient descent to minimize loss_fn.
Returns the minimizing value of theta once theta changes
less than 0.001 between iterations.
"""
NUM_OBS = len(dataset)
theta = 0
np.random.shuffle(dataset)
while True:
for i in range(0, NUM_OBS, 1):
rand_obs = dataset[i]
gradient = grad_loss_fn(theta, rand_obs)
new_theta = theta - alpha * gradient
if abs(new_theta - theta) < 0.001:
return new_theta
theta = new_theta
np.random.shuffle(dataset)
Mini-batch gradient descent strikes a balance between batch gradient descent and stochastic gradient descent by increasing the number of observations that we select at each iteration. In mini-batch gradient descent, we use a few data points for each gradient update instead of a single point.
We use the average of the gradients of their loss functions to construct an estimate of the true gradient of the cross entropy loss. If is the mini-batch of data points that we randomly sample from the observations, the following approximation holds.
As with stochastic gradient descent, we perform mini-batch gradient descent by shuffling our training data and selecting mini-batches by iterating through the shuffled data. After each epoch, we re-shuffle our data and select new mini-batches.
While we have made the distinction between stochastic and mini-batch gradient descent in this textbook, stochastic gradient descent is sometimes used as an umbrella term that encompasses the selection of a mini-batch of any size.
Mini-batch gradient descent is most optimal when running on a Graphical Processing Unit (GPU) chip found in some computers. Since computations on these types of hardware can be executed in parallel, using a mini-batch can increase the accuracy of the gradient without increasing computation time. Depending on the memory of the GPU, the mini-batch size is often set between 10 and 100 observations.
A function for mini-batch gradient descent requires the ability to select a batch size. Below is a function that implements this feature.
def minimize_mini_batch(loss_fn, grad_loss_fn, dataset, minibatch_size, alpha=0.2):
"""
Uses mini-batch gradient descent to minimize loss_fn.
Returns the minimizing value of theta once theta changes
less than 0.001 between iterations.
"""
NUM_OBS = len(dataset)
assert minibatch_size < NUM_OBS
theta = 0
np.random.shuffle(dataset)
while True:
for i in range(0, NUM_OBS, minibatch_size):
mini_batch = dataset[i:i+minibatch_size]
gradient = grad_loss_fn(theta, mini_batch)
new_theta = theta - alpha * gradient
if abs(new_theta - theta) < 0.001:
return new_theta
theta = new_theta
np.random.shuffle(dataset)
We use batch gradient descent to iteratively improve model parameters until the model achieves minimal loss. Since batch gradient descent is computationally intractable with large datasets, we often use stochastic gradient descent to fit models instead. When using a GPU, mini-batch gradient descent can converge more quickly than stochastic gradient descent for the same computational cost. For large datasets, stochastic gradient descent and mini-batch gradient descent are often preferred to batch gradient descent for their faster computation times.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/12'))
We have introduced a sequence of steps to create a model using a dataset:
Thus far, we have introduced the constant model (1), a set of loss functions (2), and gradient descent as a general method of minimizing the loss (3). Following these steps will often generate a model that makes accurate predictions on the dataset it was trained on.
Unfortunately, a model that only performs well on its training data has little real-world utility. We care about the model's ability to generalize. Our model should make accurate predictions about the population, not just the training data. This problem seems challenging to answer—how might we reason about data we haven't seen yet?
Here we turn to the inferential power of statistics. We first introduce some mathematical tools: random variables, expectation, and variance. Using these tools, we draw conclusions about our model's long-term performance on data from our population, even data that we did not use to train the model!
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/12'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
import scipy.stats as stats
Almost all real-world phenomena contain some degree of randomness, making data generation and collection inherently random processes. Since we fit our models on these data, our models also contain randomness. To represent these random processes mathematically, we use random variables.
A random variable is an algebraic variable that represents a numerical value determined by a probabilistic event. In this book, we will always use capital letters (not Greek letters) like or to denote a random variable. Although random variables can represent either discrete (e.g. the number of males in a sample of ten people) or continuous quantities (e.g. the average temperature in Los Angeles), we will only use discrete random variables for the purposes of this textbook.
We must always specify what a given random variable represents. For example, we may write that the random variable represents the number of heads in 10 coin flips. The definition of a random variable determines the values it can take on. In this example, may only take on values between and inclusive.
We must also be able to determine the probability that the random variable takes on each of its possible values. For example, the probability that is written as and we can likewise calculate the probability that is any value in .
The probability mass function (PMF) or the distribution of a random variable provides the probability that takes on each of its possible values. If we let be the set of values that can take on and be a particular value in , the PMF of must satisfy the following rules:
The first rule states that the probabilities for all possible values of sum to .
The second rule states that each individual probability for a given value of must be between and .
Suppose we let represent the result of one roll from a fair six-sided die. We know that and that . We can plot the PMF of as a probability distribution:
# HIDDEN
def plot_pmf(xs, probs, rv_name='X'):
plt.plot(xs, probs, 'ro', ms=12, mec='b', color='b')
plt.vlines(xs, 0, probs, colors='b', lw=4)
plt.xlabel('$x$')
plt.ylabel('$P(X = x)$')
plt.ylim(0, 1)
plt.title('PMF of $X$');
# HIDDEN
xk = np.arange(1, 7)
pk = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)
plot_pmf(np.arange(1, 7), np.repeat(1/6, 6))
plt.yticks(np.linspace(0, 1, 7),
('0', r'$\frac{1}{6}$', r'$\frac{2}{6}$', r'$\frac{3}{6}$',
r'$\frac{4}{6}$', r'$\frac{5}{6}$', '1'));
The notion of PMFs for single random variables extends naturally to joint distributions for multiple random variables. In particular, the joint distribution of two or more random variables yields the probability that these random variables simultaneously take on a specific set of values.
For example, let the random variable represent the number of heads in 10 coin flips, and let represent the number of tails in the same set of 10 coin flips. We can note that:
Meanwhile since we cannot possibly have 6 heads and 6 tails in 10 coin flips.
Sometimes, we start with the joint distribution for two random variables and but want to find the distribution for alone. This distribution is called the marginal distribution. To find the probability that takes on a particular value, we must consider all possible values of (denoted by ) that can simultaneously happen with and sum over all of these joint probabilities:
We can prove this identity as follows:
In the last line of this proof, we treated as a random variable with some unknown PMF. This is important since we used the property that the probabilities in a PMF sum to , which means that .
Like events, two random variables can be dependent or independent. Any two random variables are independent if and only if knowing the outcome of one variable does not alter the probability of observing any outcomes of the other variable.
For example, suppose we flip a coin ten times and let be the number of heads and be the number of tails. Clearly, and are dependent variables since knowing that means that must equal . If we did not observe the value of , can take on any value between and with non-zero probability.
We might instead conduct two sets of ten flips. If is the number of heads in the first set of flips and is the number of heads in the second set, and are independent since the outcomes of the first set of ten flips do not affect the outcomes of the second set.
Suppose we have a small dataset of four people:
# HIDDEN
data={"Name":["Carol","Bob","John","Dave"], 'Age': [50,52,51,50]}
people = pd.DataFrame(data)
people
Suppose we sample two people from this dataset with replacement. If the random variable represents the difference between the ages of the first and second persons in the sample, what is the PMF of ?
To approach this problem, we define two new random variables. We define as the age of the first person and as the age of the second. Then, . Then, we find the joint probability distribution of and : the probability of each value that and can take on simultaneously. In this case, note that and are independent and identically distributed; the two random variables represent two independent draws from the same dataset, and the first draw has no influence on the second. For example, the probability that and is . In a similar way, we get:
| 4/16 | 2/16 | 2/16 | |
| 2/16 | 1/16 | 1/16 | |
| 2/16 | 1/16 | 1/16 |
Let us now consider the case in which we sample two people from the same dataset as above but without replacement. As before, we define as the age of the first person and as the age of the second, and . However, now and are not independent; for example, if we know , then . We find the joint distribution of and as follows:
| 2/12 | 2/12 | 2/12 | |
| 2/12 | 0 | 1/12 | |
| 2/12 | 1/12 | 0 |
We can also find the marginal distribution of from the table.
Notice that we summed each column of the joint distribution table above. One can imagine computing the sum of each column and writing the result in the margin below the table; this is the origin of the term marginal distribution.
You should also notice that and are not independent when we sample without replacement. If , for example, . Nonetheless, and still have the same marginal distribution.
In this section, we introduce random variables, mathematical variables that take on values according to a random process. These outcomes must be defined completely and precisely—each outcome must have a well-defined probability of occurrence. Random variables are useful for representing many random phenomena, including the process of data collection.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/12'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
Although a random variable is completely described by its probability mass function (PMF), we often use expectation and variance to describe the variable's long-run average and spread. These two values have unique mathematical properties that hold particular importance for data science—for example, we can show that an estimation is accurate in the long term by showing that its expected value is equal to the population parameter. We proceed by defining expectation and variance, introducing their most useful mathematical properties, and conclude with a brief application to estimation.
We are often interested in the long-run average of a random variable because it gives us a sense of the center of the variable's distribution. We call this long-run average the expected value, or the expectation of a random variable. The expected value of a random variable is:
For example, if represents the roll of a single fair six-sided die,
Notice that the expected value of does not have to be a possible value of . Although , cannot actually take on the value .
Example: Recall our dataset from the previous section:
# HIDDEN
data={"Name":["Carol","Bob","John","Dave"], 'Age': [50,52,51,50]}
people = pd.DataFrame(data)
people
We pick one person from this dataset uniformly at random. Let be a random variable representing the age of this person. Then:
Example: Suppose we sample two people from the dataset with replacement. If the random variable represents the difference between the ages of the first and second persons in the sample, what is ?
As in the previous section, we define as the age of the first person and as the age of the second such that . From the joint distribution of and given in the previous section, we can find the PMF for . For example, . Thus,
Since , we expect that in the long run the difference between the ages of the people in a sample of size 2 will be 0.
When working with linear combinations of random variables as we did above, we can often make good use of the linearity of expectation instead of tediously calculating each joint probability individually.
The linearity of expectation states that:
From this statement we may also derive:
where and are random variables, and is a constant.
In words, the expectation of a sum of any two random variables is equal to the sum of the expectations of the variables.
In the previous example, we saw that . Thus, .
Now we can calculate and separately from each other. Since , .
The linearity of expectation holds even if and are dependent on each other! As an example, let us again consider the case in which we sample two people from our small dataset in the previous section without replacement. As before, we define as the age of the first person and as the age of the second, and . Clearly, and are not independent—knowing , for example, means that .
From the joint distribution of and given in the previous section, we can find :
A simpler way to compute this expectation is to use the linearity of expectation. Even though and dependent, . Recall from the previous section that and have the same PMF even though we are sampling without replacement, which means that . Hence as in the first scenario, .
Note that the linearity of expectation only holds for linear combinations of random variables. For example, is not a linear combination of and . In this case, is true in general only for independent random variables.
The variance of a random variable is a numerical description of the variable's spread. For a random variable :
The above formula states that the variance of is the average squared distance from 's expected value.
With some algebraic manipulation that we omit for brevity, we may also equivalently write:
Consider the following two random variables and with the following probability distributions:
# HIDDEN
def plot_pmf(xs, probs, rv_name='X', val_name='x', prob_denom=4):
plt.plot(xs, probs, 'ro', ms=12, mec='b', color='b')
plt.vlines(xs, 0, probs, colors='b', lw=4)
plt.xlabel(f'${val_name}$')
plt.ylabel(f'$P({rv_name} = {val_name})$')
plt.ylim(0, 1)
plt.yticks(np.linspace(0, 1, prob_denom + 1),
['0']
+ [rf'$\frac{{{n}}}{{{prob_denom}}}$'
for n in range(1, prob_denom)]
+ ['1'])
plt.title(f'PMF of ${rv_name}$');
# HIDDEN
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plot_pmf([-1, 1], [0.5, 0.5])
plt.xlim(-2, 2);
plt.subplot(1, 2, 2)
plot_pmf((-2, -1, 1, 2), (1/4, 1/4, 1/4, 1/4), rv_name='Y', val_name='y')
plt.tight_layout()
takes on values -1 and 1 with probability each. takes on values -2, -1, 1, and 2 with probability each. We find that . Since 's distribution has a higher spread than 's, we expect that is larger than .
As expected, the variance of is greater than the variance of .
The variance has a useful property to simplify some calculations. If is a random variable:
If two random variables and are independent:
Note that the linearity of expectation holds for any and even if they are dependent. However, holds only when and are independent.
The covariance of two random variables and is defined as:
Again, we can perform some algebraic manipulation to obtain:
Note that although the variance of a single random variable must be non-negative, the covariance of two random variables can be negative. In fact, the covariance helps measure the correlation between two random variables; the sign of the covariance helps us determine whether two random variables are positively or negatively correlated. If two random variables and are independent, then , and .
Suppose we want to use a random variable to a simulate a biased coin with . We can say that if the coin flip is heads, and if the coin flip is tails. Therefore, , and . This type of binary random variable is called a Bernoulli random variable; we can calculate its expected value and variance as follows:
Suppose we possess a biased coin with and we would like to estimate . We can flip the coin times to collect a sample of flips and calculate the proportion of heads in our sample, . If we know that is often close to , we can use as an estimator for .
Notice that is not a random quantity; it is a fixed value based on the bias of the coin. , however, is a random quantity since it is generated from the random outcomes of flipping the coin. Thus, we can compute the expectation and variance of to precisely understand how well it estimates .
To compute , we will first define random variables for each flip in the sample. Let be a Bernoulli random variable for the coin flip. Then, we know that:
To calculate the expectation of , we can plug in the formula above and use the fact that since is a Bernoulli random variable.
We find that . In other words, with enough flips we expect our estimator to converge to the true coin bias . We say that is an unbiased estimator of .
Next, we calculate the variance of . Since each flip is independent from the others, we know that are independent. This allows us to use the linearity of variance.
From the equivalence above, we see that the variance of our estimator decreases as we increase , the number of flips in our sample. In other words, if we collect lots of data we can be more certain about our estimator's value. This behavior is known as the law of large numbers.
We use expectation and variance to provide simple descriptions of a random variable's center and spread. These mathematical tools allow us to determine how well an quantity calculated from a sample estimates a quantity in the population.
Minimizing a loss function creates a model that is accurate on its training data. Expectation and variance allow us to make general statements about the model's accuracy on unseen data from the population.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/12'))
In a modeling scenario presented in a previous chapter, a waiter collected a dataset of tips for a particular month of work. We selected a constant model and minimized the mean squared error (MSE) loss function on this dataset, guaranteeing that our constant model outperforms all other constant models on this dataset and loss function. The constant model has a single parameter, . We found that the optimizing parameter for the MSE loss.
Although such a model makes relatively accurate predictions on its training data, we would like to know whether the model will perform well on new data from the population. To represent this notion, we introduce statistical risk, also known as the expected loss.
A model's risk is the expected value of the model's loss on randomly chosen points from the population.
In this scenario, the population consists of all tip percentages our waiter receives during his employment, including future tips. We use the random variable to represent a randomly chosen tip percent from the population, and the usual variable to represent the constant model's prediction. Using this notation, the risk of our model is:
In the expression above, we use the MSE loss which gives the inner in the expectation. The risk is a function of since we can change as we please.
Unlike loss alone, using risk allows us to reason about the model's accuracy on the population in general. If our model achieves a low risk, our model will make accurate predictions on points from the population in the long term. On the other hand, if our model has a high risk it will in general perform poorly on data from the population.
Naturally, we would like to choose the value of that makes the model's risk as low as possible. We use the variable to represent the risk-minimizing value of , or the optimal model parameter for the population. To clarify, represents the model parameter that minimizes risk while represents the parameter that minimizes dataset-specific loss.
Let's find the value of that minimizes the risk. Previously, we used calculus to perform this minimization. This time, we will use a mathematical trick that produces a meaningful final expression. We replace with and expand:
Now, we apply the linearity of expectation and simplify. We use the identity which is roughly equivalent to stating that lies at the center of the distribution of .
$$ \begin{aligned} R(\theta) &= \mathbb{E}\left[ (X - \mathbb{E}[X])^2 \right]
Notice that the first term in the expression above is the variance of , , which has no dependence on . The second term gives a measure of how close is to . Because of this, the second term is called the bias of our model. In other words, the model's risk is the bias of the model plus the variance of the quantity we are trying to predict:
$$ \begin{aligned} R(\theta) &= \underbrace{(\mathbb{E}[X]- \theta)^2}_\text{bias}
Thus, the risk is minimized when our model has no bias: .
Notice that when our model has no bias, the risk is usually a positive quantity. This implies that even an optimal model will have prediction error. Intuitively, this occurs because a constant model will only predict a single number while may take on any value from the population. The variance term captures the magnitude of the error. A low variance means that will likely take a value close to , whereas a high variance means that is more likely to take on a value far from .
From the above analysis, we would like to set . Unfortunately, calculating requires complete knowledge of the population. To understand why, examine the expression for :
represents the probability that takes on a specific value from the population. To calculate this probability, however, we need to know all possible values of and how often they appear in the population. In other words, to perfectly minimize a model's risk on a population, we need access to the population.
We can tackle this issue by remembering that the distribution of values in a large random sample will be close to the distribution of values in the population. If this is true about our sample, we can treat the sample as though it were the population itself.
Suppose we draw points at random from the sample instead of the population. Since there are total points in the sample , each point has probability of appearing. Now we can create an approximation for :
Thus, our best estimate of using the information captured in a random sample is . We say that minimizes the empirical risk, the risk calculated using the sample as a stand-in for the population.
It is essential to note the importance of random sampling in the approximation above. If our sample is non-random, we cannot make the above assumption that the sample's distribution is similar to the population's. Using a non-random sample to estimate will often result in a biased estimation and a higher risk.
Recall that we have previously shown minimizes the MSE loss on a dataset. Now, we have taken a meaningful step forward. If our training data are a random sample, not only produces the best model for its training data but also produces the best model for the population given the information we have in our sample.
Using the mathematical tools developed in this chapter, we have developed an understanding of our model's performance on the population. A model makes accurate predictions if it minimizes statistical risk. We found that the globally optimal model parameter is:
Since we cannot readily compute this, we found the model parameter that minimizes the empirical risk.
If the training data are randomly sampled from the population, it is likely that . Thus, a constant model trained on a large random sample from the population will likely perform well on the population as well.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/13'))
Now that we have a general method for fitting a model to a cost function, we turn our attention to improvements on our model. For the sake of simplicity, we previously restricted ourselves to a constant model: our model only ever predicts a single number.
However, giving our waiter such a model would hardly satisfy him. He would likely point out that he collected much more information about his tables than simply the tip percents. Why didn't we use his other data—e.g. size of the table or total bill—in order to make our model more useful?
In this chapter we will introduce linear models which will allow us to make use of our entire dataset to make predictions. Linear models are not only widely used in practice but also have rich theoretical underpinnings that will allow us to understand future tools for modeling. We introduce a simple linear regression model that uses one explanatory variable, explain how gradient descent is used to fit the model, and finally extend the model to incorporate multiple explanatory variables.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/13'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
Previously, we worked with a dataset that contained one row for each table that a waiter served in a week. Our waiter collected this data in order to predict the tip amount he could expect to receive from a future table.
tips = sns.load_dataset('tips')
tips.head()
sns.distplot(tips['tip'], bins=25);
As we have covered previously, if we choose a constant model and the mean squared error cost, our model will predict the mean of the tip amount:
np.mean(tips['tip'])
This means that if a new party orders a meal and the waiter asks us how much tip he will likely receive, we will say "around $3", no matter how large the table is or how much their total bill was.
However, looking at other variables in the dataset, we see that we might be able to make more accurate predictions if we incorporate them into our model. For example, the following plot of the tip amount against the total bill shows a positive association.
# HIDDEN
sns.lmplot(x='total_bill', y='tip', data=tips, fit_reg=False)
plt.title('Tip amount vs. Total Bill')
plt.xlabel('Total Bill')
plt.ylabel('Tip Amount');
Although the average tip amount is , if a table orders worth of food we would certainly expect that the waiter receives more than of tip. Thus, we would like to alter our model so that it makes predictions based on the variables in our dataset instead of blindly predicting the mean tip amount. To do this, we use a linear model instead of constant one.
Let's briefly review our current toolbox for modeling and estimation and define some new notation so that we can better represent the additional complexity that linear models have.
We are interested in predicting the tip amount based on the total bill of a table. Let represent the tip amount, the variable we are trying to predict. Let represent the total bill, the variable we are incorporating for prediction.
We define a linear model that depends on :
We treat as the underlying function that generated the data.
assumes that in truth, has a perfectly linear relationship with . However, our observed data do not follow a perfectly straight line because of some random noise . Mathematically, we account for this by adding a noise term:
If the assumption that has a perfectly linear relationship with holds, and we are able to somehow find the exact values of and , and we magically have no random noise, we will be able to perfectly predict the amount of tip the waiter will get for all tables, forever. Of course, we cannot completely fulfill any of these criteria in practice. Instead, we will estimate and using our dataset to make our predictions as accurate as possible.
Since we cannot find and exactly, we will assume that our dataset approximates our population and use our dataset to estimate these parameters. We denote our estimations with and , our fitted estimations with and , and our model as:
Sometimes you will see written instead of ; the "" stands for hypothesis, as is our hypothesis of .
In order to determine and , we choose a cost function and minimize it using gradient descent.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/13'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
tips = sns.load_dataset('tips')
# HIDDEN
def minimize(loss_fn, grad_loss_fn, x_vals, y_vals,
alpha=0.0005, progress=True):
'''
Uses gradient descent to minimize loss_fn. Returns the minimizing value of
theta once the loss changes less than 0.0001 between iterations.
'''
theta = np.array([0., 0.])
loss = loss_fn(theta, x_vals, y_vals)
while True:
if progress:
print(f'theta: {theta} | loss: {loss}')
gradient = grad_loss_fn(theta, x_vals, y_vals)
new_theta = theta - alpha * gradient
new_loss = loss_fn(new_theta, x_vals, y_vals)
if abs(new_loss - loss) < 0.0001:
return new_theta
theta = new_theta
loss = new_loss
We want to fit a linear model that predicts the tip amount based on the total bill of the table:
In order to optimize and , we need to first choose a loss function. We will choose the mean squared error loss function:
Note that we have modified our loss function to reflect the addition of an explanatory variable in our new model. Now, is a vector containing the individual total bills, is a vector containing the individual tip amounts, and is a vector: .
Using a linear model with the squared error also goes by the name of least-squares linear regression. We can use gradient descent to find the that minimizes the loss.
An Aside on Using Correlation
If you have seen least-squares linear regression before, you may recognize that we can compute the correlation coefficient and use it to determine and . This is simpler and faster to compute than using gradient descent for this specific problem, similar to how computing the mean was simpler than using gradient descent to fit a constant model. We will use gradient descent anyway because it is a general-purpose method of loss minimization that still works when we later introduce models that do not have analytic solutions. In fact, in many real-world scenarios, we will use gradient descent even when an analytic solution exists because computing the analytic solution can take longer than gradient descent, especially on large datasets.
In order to use gradient descent, we have to compute the derivative of the MSE loss with respect to . Now that is a vector of length 2 instead of a scalar, will also be a vector of length 2.
We know:
We now need to compute which is a length 2 vector.
Finally, we plug back into our formula above to get
This is a length 2 vector since is scalar.
Now, let's fit a linear model on the tips dataset to predict the tip amount from the total table bill.
First, we define a Python function to compute the loss:
def simple_linear_model(thetas, x_vals):
'''Returns predictions by a linear model on x_vals.'''
return thetas[0] + thetas[1] * x_vals
def mse_loss(thetas, x_vals, y_vals):
return np.mean((y_vals - simple_linear_model(thetas, x_vals)) ** 2)
Then, we define a function to compute the gradient of the loss:
def grad_mse_loss(thetas, x_vals, y_vals):
n = len(x_vals)
grad_0 = y_vals - simple_linear_model(thetas, x_vals)
grad_1 = (y_vals - simple_linear_model(thetas, x_vals)) * x_vals
return -2 / n * np.array([np.sum(grad_0), np.sum(grad_1)])
# HIDDEN
thetas = np.array([1, 1])
x_vals = np.array([3, 4])
y_vals = np.array([4, 5])
assert np.allclose(grad_mse_loss(thetas, x_vals, y_vals), [0, 0])
We'll use the previously defined minimize function that runs gradient descent, accounting for our new explanatory variable. It has the function signature (body omitted):
minimize(loss_fn, grad_loss_fn, x_vals, y_vals)
Finally, we run gradient descent!
%%time
thetas = minimize(mse_loss, grad_mse_loss, tips['total_bill'], tips['tip'])
We can see that gradient descent converges to the theta values of and . Our linear model is:
We can use our estimated thetas to make and plot our predictions alongside the original data points.
# HIDDEN
x_vals = np.array([0, 55])
sns.lmplot(x='total_bill', y='tip', data=tips, fit_reg=False)
plt.plot(x_vals, simple_linear_model(thetas, x_vals), c='goldenrod')
plt.title('Tip amount vs. Total Bill')
plt.xlabel('Total Bill')
plt.ylabel('Tip Amount');
We can see that if a table's bill is , our model will predict that the waiter gets around in tip. Similarly, if a table's bill is , our model will predict a tip of around .
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/13'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + nrows, col:col + ncols]
if len(df.columns) <= ncols:
interact(peek, row=(0, len(df) - nrows, nrows), col=fixed(0))
else:
interact(peek,
row=(0, len(df) - nrows, nrows),
col=(0, len(df.columns) - ncols))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
# HIDDEN
from scipy.optimize import minimize as sci_min
def minimize(loss_fn, grad_loss_fn, X, y, progress=True):
'''
Uses scipy.minimize to minimize loss_fn using a form of gradient descent.
'''
theta = np.zeros(X.shape[1])
iters = 0
def objective(theta):
return loss_fn(theta, X, y)
def gradient(theta):
return grad_loss_fn(theta, X, y)
def print_theta(theta):
nonlocal iters
if progress and iters % progress == 0:
print(f'theta: {theta} | loss: {loss_fn(theta, X, y):.2f}')
iters += 1
print_theta(theta)
return sci_min(
objective, theta, method='BFGS', jac=gradient, callback=print_theta,
tol=1e-7
).x
Our simple linear model has a key advantage over the constant model: it uses the data when making predictions. However, it is still rather limited since simple linear models only use one variable in our dataset. Many datasets have many potentially useful variables, and multiple linear regression can take advantage of that. For example, consider the following dataset on car models and their milage per gallon (MPG):
mpg = pd.read_csv('mpg.csv').dropna().reset_index(drop=True)
mpg
It seems likely that multiple attributes of a car model will affect its MPG. For example, the MPG seems to decrease as horsepower increases:
# HIDDEN
sns.lmplot(x='horsepower', y='mpg', data=mpg);
However, cars released later generally have better MPG than older cars:
sns.lmplot(x='model year', y='mpg', data=mpg);
It seems possible that we can get a more accurate model if we could take both horsepower and model year into account when making predictions about the MPG. In fact, perhaps the best model takes into account all the numerical variables in our dataset. We can extend our univariate linear regression to allow prediction based on any number of attributes.
We state the following model:
Where now represents a vector containing attributes of a single car. The model above says, "Take multiple attributes of a car, multiply them by some weights, and add them together to make a prediction for MPG."
For example, if we're making a prediction on the first car in our dataset using the horsepower, weight, and model year columns, the vector looks like:
# HIDDEN
mpg.loc[0:0, ['horsepower', 'weight', 'model year']]
In there examples, we've kept the column names for clarity but keep in mind that only contains the numerical values of the table above: .
Now, we will perform a notational trick that will greatly simplify later formulas. We will prepend the value to the vector , so that we have the following vector for :
# HIDDEN
mpg_mat = mpg.assign(bias=1)
mpg_mat.loc[0:0, ['bias', 'horsepower', 'weight', 'model year']]
Now, observe what happens to the formula for our model:
Where is the vector dot product of and . Vector and matrix notation were designed to succinctly write linear combinations and is thus well-suited for our linear models. However, you will have to remember from now on that is a vector-vector dot product. When in doubt, you can always expand the dot product into simple multiplications and additions.
Now, we define the matrix as the matrix containing every car model as a row and a first column of biases. For example, here are the first five rows of :
# HIDDEN
mpg_mat = mpg.assign(bias=1)
mpg_mat.loc[0:4, ['bias', 'horsepower', 'weight', 'model year']]
Again, keep in mind that the actual matrix only contains the numerical values of the table above.
Notice that is composed of multiple vectors stacked on top of each other. To keep the notation clear, we define to refer to the row vector with index of . We define to refer to the element with index of the row with index of . Thus, is a -dimensional vector and is a scalar. is an matrix, where is the number of car examples we have and is the number of attributes we have for a single car.
For example, from the table above we have and . This notation becomes important when we define the loss function since we will need both , the matrix of input values, and , the vector of MPGs.
The mean squared error loss function takes in a vector of weights , a matrix of inputs , and a vector of observed MPGs :
We've previously derived the gradient of the MSE loss with respect to :
We know that:
Let's now compute . The result is surprisingly simple because and thus , , and so on.
Finally, we plug this result back into our gradient calculations:
Remember that since is a scalar and is a -dimensional vector, the gradient is a -dimensional vector.
We saw this same type of result when we computed the gradient for univariate linear regression and found that it was 2-dimensional since was 2-dimensional.
We can now plug in our loss and its derivative into gradient descent. As usual, we will define the model, loss, and gradient loss in Python.
def linear_model(thetas, X):
'''Returns predictions by a linear model on x_vals.'''
return \textbf{X} @ thetas
def mse_loss(thetas, X, y):
return np.mean((y - linear_model(thetas, X)) ** 2)
def grad_mse_loss(thetas, X, y):
n = len(X)
return -2 / n * (X.T @ \textbf{y} - X.T @ \textbf{X} @ thetas)
# HIDDEN
thetas = np.array([1, 1, 1, 1])
\textbf{X} = np.array([[2, 1, 0, 1], [1, 2, 3, 4]])
y = np.array([3, 9])
assert np.allclose(linear_model(thetas, X), [4, 10])
assert np.allclose(mse_loss(thetas, X, y), 1.0)
assert np.allclose(grad_mse_loss(thetas, X, y), [ 3., 3., 3., 5.])
assert np.allclose(grad_mse_loss(thetas, \textbf{X} + 1, y), [ 25., 25., 25., 35.])
Now, we can simply plug in our functions into our gradient descent minimizer:
# HIDDEN
\textbf{X} = (mpg_mat
.loc[:, ['bias', 'horsepower', 'weight', 'model year']]
.as_matrix())
y = mpg_mat['mpg'].as_matrix()
%%time
thetas = minimize(mse_loss, grad_mse_loss, X, y)
print(f'theta: {thetas} | loss: {mse_loss(thetas, X, y):.2f}')
According to gradient descent, our linear model is:
How does our model do? We can see that the loss decreased dramatically (from 610 to 11.6). We can show the predictions of our model alongside the original values:
# HIDDEN
reordered = ['predicted_mpg', 'mpg', 'horsepower', 'weight', 'model year']
with_predictions = (
mpg
.assign(predicted_mpg=linear_model(thetas, X))
.loc[:, reordered]
)
with_predictions
Since we found from gradient descent, we can verify for the first row of our data that matches our prediction above:
print(f'Prediction for first row: '
f'{thetas[0] + thetas[1] * 130 + thetas[2] * 3504 + thetas[3] * 70:.2f}')
We've included a widget below to pan through the predictions and the data used to make the prediction:
# HIDDEN
df_interact(with_predictions)
We can also plot the residuals of our predictions (actual values - predicted values):
resid = \textbf{y} - linear_model(thetas, X)
plt.scatter(np.arange(len(resid)), resid, s=15)
plt.title('Residuals (actual MPG - predicted MPG)')
plt.xlabel('Index of row in data')
plt.ylabel('MPG');
It looks like our model makes reasonable predictions for many car models, although there are some predictions that were off by over 10 MPG (some cars had under 10 MPG!). Perhaps we are more interested in the percent error between the predicted MPG values and the actual MPG values:
resid_prop = resid / with_predictions['mpg']
plt.scatter(np.arange(len(resid_prop)), resid_prop, s=15)
plt.title('Residual proportions (resid / actual MPG)')
plt.xlabel('Index of row in data')
plt.ylabel('Error proportion');
It looks like our model's predictions are usually within 20% away from the actual MPG values.
Notice that in our example thus far, our matrix has four columns: one column of all ones, the horsepower, the weight, and the model year. However, model allows us to handle an arbitrary number of columns:
As we include more columns into our data matrix, we extend so that it has one parameter for each column in . Instead of only selecting three numerical columns for prediction, why not use all seven of them?
# HIDDEN
cols = ['bias', 'cylinders', 'displacement', 'horsepower',
'weight', 'acceleration', 'model year', 'origin']
\textbf{X} = mpg_mat[cols].as_matrix()
mpg_mat[cols]
%%time
thetas_all = minimize(mse_loss, grad_mse_loss, X, y, progress=10)
print(f'theta: {thetas_all} | loss: {mse_loss(thetas_all, X, y):.2f}')
According to gradient descent, our linear model is:
We see that our loss has decreased from 11.6 with three columns of our dataset to 10.85 when using all seven numerical columns of our dataset. We display the proportion error plots for both old and new predictions below:
# HIDDEN
resid_prop_all = (y - linear_model(thetas_all, X)) / with_predictions['mpg']
plt.figure(figsize=(10, 4))
plt.subplot(121)
plt.scatter(np.arange(len(resid_prop)), resid_prop, s=15)
plt.title('Residual proportions using 3 columns')
plt.xlabel('Index of row in data')
plt.ylabel('Error proportion')
plt.ylim(-0.7, 0.7)
plt.subplot(122)
plt.scatter(np.arange(len(resid_prop_all)), resid_prop_all, s=15)
plt.title('Residual proportions using 7 columns')
plt.xlabel('Index of row in data')
plt.ylabel('Error proportion')
plt.ylim(-0.7, 0.7)
plt.tight_layout();
Although the difference is slight, you can see that the errors are a bit lower when using seven columns compared to using three. Both models are much better than using a constant model, as the below plot shows:
# HIDDEN
constant_resid_prop = (y - with_predictions['mpg'].mean()) / with_predictions['mpg']
plt.scatter(np.arange(len(constant_resid_prop)), constant_resid_prop, s=15)
plt.title('Residual proportions using constant model')
plt.xlabel('Index of row in data')
plt.ylabel('Error proportion')
plt.ylim(-1, 1);
Using a constant model results in over 75% error for many car MPGs!
We have introduced the linear model for regression. Unlike the constant model, the linear regression model takes in features of our data into account when making predictions, making it much more useful whenever we have correlations between variables of our data.
The procedure of fitting a model to data should now be quite familiar:
It is useful to know that we can usually tweak one of the components without changing the others. In this section, we introduced the linear model without changing our loss function or using a different minimization algorithm. Although modeling can get complicated, it is usually easier to learn by focusing on one component at a time, then combining different parts together as needed in practice.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/13'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
Recall that we found the optimal coefficients for linear models by optimizing their loss functions with gradient descent. We also mentioned that least squares linear regression can be solved analytically. While gradient descent is practical, this geometric perspective will provide a deeper understanding of linear regression.
A Vector Space Review is included in the Appendix. We will assume familiarity with vector arithmetic, the 1-vector, span of a collection of vectors, and projections.
We've been tasked with finding a good linear model for the data:
| x | y |
|---|---|
| 3 | 2 |
| 0 | 1 |
| -1 | -2 |
# HIDDEN
data = pd.DataFrame(
[
[3,2],
[0,1],
[-1,-2]
],
columns=['x', 'y']
)
sns.regplot(x='x', y='y', data=data, ci=None, fit_reg=False);
Assume that the best model is one with the least error, and that the least squares error is an acceptable measure.
Like we did with the tips dataset, let's start with the constant model: the model that only ever predicts a single number.
Thus, we are working with just the -values.
| y |
|---|
| 2 |
| 1 |
| -2 |
Our goal is to find the that results in the line that minimizes the squared loss:
Recall that for the constant model, the minimizing for MSE is , the average of the values. The calculus derivation can be found in the Loss Functions lesson in the Modeling and Estimations chapter. For the linear algebra derivation, please refer to the Vector Space Review in the Appendix.
Notice that our loss function is a sum of squares. The L2-norm for a vector is also a sum of squares, but with a square root:
If we let :
This means our loss can be expressed as the L2-norm of some vector , squared. We can express as so that in Cartesian notation,
So our loss function can be written as:
The expression is a scalar multiple of the columns of the vector, and is the result of our predictions, denoted .
This gives us a new perspective on what it means to minimize the least squares error.
and are fixed, but can take on any value, so can be any scalar multiple of . We want to find so that is as close to as possible. We use to denote this best-fit .

The projection of onto is guaranteed to be the closest vector (see "Vector Space Review" in the Appendix).
Now, let's look at the simple linear regression model. This is strongly parallel to the constant model derivation, but be mindful of the differences and think about how you might generalize to multiple linear regression.
The simple linear model is:
Our goal is to find the that results in the line with the least squared error:
To help us visualize the translation of our loss summation into matrix form, let's expand out the loss with .
Again, our loss function is a sum of squares and the L2-norm for a vector is the square root of a sum of squares:
If we let :
As before, our loss can be expressed as the L2-norm of some vector , squared. With each component :
The matrix multiplication is a linear combination of the columns of : each only ever multiplies with one column of —this perspective shows us that is a linear combination of the features of our data.
and are fixed, but and can take on any value, so can take on any of the infinite linear combinations of the columns of . To have the smallest loss, we want to choose such that is as close to as possibled, denoted as .
Now, let's develop an intuition for why it matters that is restricted to the linear combinations of the columns of . Although the span of any set of vectors includes an infinite number of linear combinations, infinite does not mean any—the linear combinations are restricted by the basis vectors.
As a reminder, here is our loss function and scatter plot:
# HIDDEN
sns.regplot(x='x', y='y', data=data, ci=None, fit_reg=False);
By inspecting our scatter plot, we see that no line can perfectly fit our points, so we cannot achieve 0 loss. Thus, we know that is not in the plane spanned by and , represented as a parallelogram below.

Since our loss is distance-based, we can see that to minimize , we want to be as close to as possible.
Mathematically, we are looking for the projection of onto the vector space spanned by the columns of , as the projection of any vector is the closest point in to that vector. Thus, choosing such that proj is the best solution.

To see why, consider other points on the vector space, in purple.

By the Pythagorean Theorem, any other point on the plane is farther from than is. The length of the perpendicular corresponding to represents the least squared error.
Since we've snuck in a lot of linear algebra concepts already, all that's left is solving for the that yields our desired .
A couple things to note:

Thus, we arrive at the equation:
Left-multiplying both sides by :
Since is perpendicular to the columns of , is a column vector of 's. Thus, we arrive at the Normal Equation:
From here, we can easily solve for by left-multiplying both sides by :
Note: we can get this same solution by minimizing with vector calculus, but in the case of least squares loss, vector calculus isn't necessary. For other loss functions, we will need to use vector calculus to get the analytic solution.
Let's return to our case study, apply what we've learned, and explain why our solution is sound.
We have analytically found that best model for least squares regression is . We know that our choice of is sound by the mathematical property that the projection of onto the span of the columns of yields the closest point in the vector space to . Under linear constraints using the least squares loss, solving for by taking the projection guarantees us the optimal solution.
For every additional variable, we add one column to . The span of the columns of is the linear combinations of the column vectors, so adding columns only changes the span if it is linearly independent from all existing columns.
When the added column is linearly dependent, it can be expressed as a linear combination of some other columns, and thus will not introduce new any vectors to the subspace.
Recall that the span of is important because it is the subspace we want to project onto. If the subspace does not change, then the projection will not change.
For example, when we introduced to the constant model to get the simple linear model, we introduced a independent variable. cannot be expressed as a scalar of . Thus, we moved from finding the projection of onto a line:

to finding the projection of onto a plane:

Now, lets introduce another variable, , and explicitly write out the bias column:
| z | 1 | x | y |
|---|---|---|---|
| 4 | 1 | 3 | 2 |
| 1 | 1 | 0 | 1 |
| 0 | 1 | -1 | -2 |
Notice that . Since is a linear combination of and , it lies in the original . Formally, is linearly dependent to , and does not change . Thus, the projection of onto the subspace spanned by , , and would be the same as the projection of onto the subspace spanned by and .

We can also observe this from minimizing the loss function:
Our possible solutions follow the form .
Since , regardless of , , and , the possible values can be rewritten as:
So adding does not change the problem at all. The only difference is, we can express this projection in multiple ways. Recall that we found the projection of onto the plane spanned by and to be:
However, with the introduction of , we have more ways to express this same projection vector.
Since , can also be expressed as:
Since , can also be expressed as:
But all three expressions represent the same projection.
In conclusion, adding a linearly dependent column to does not change , and thus will not change the projection and solution to the least squares problem.
We included the scatter plots twice in this lesson. The first reminded us that like before, we are finding the best-fit line for the data. The second showed that there was no line that could fit all points. Apart from these two occurences, we tried not to disrupt our vector space drawings with scatter plots. This is because scatter plots correspond with the row-space perspective of the least squares problem: looking at each data point and trying to minimize the distance between our predictions and each datum. In this lesson, we looked at the column-space perspective: each feature was a vector, constructing a space of possible solutions (projections).
Both perspectives are valid and helpful to understand, and we hope you had fun seeing both sides of the least squares problem!
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/13'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
from scipy.optimize import minimize as sci_min
def minimize(cost_fn, grad_cost_fn, X, y, progress=True):
'''
Uses scipy.minimize to minimize cost_fn using a form of gradient descent.
'''
theta = np.zeros(X.shape[1])
iters = 0
def objective(theta):
return cost_fn(theta, X, y)
def gradient(theta):
return grad_cost_fn(theta, X, y)
def print_theta(theta):
nonlocal iters
if progress and iters % progress == 0:
print(f'theta: {theta} | cost: {cost_fn(theta, X, y):.2f}')
iters += 1
print_theta(theta)
return sci_min(
objective, theta, method='BFGS', jac=gradient, callback=print_theta,
tol=1e-7
).x
In this section, we perform an end-to-end case study of applying the linear regression model to a dataset. The dataset we will be working with has various attributes, such as length and girth, of donkeys.
Our task is to predict a donkey's weight using linear regression.
We will begin by reading in the dataset and taking a quick peek at its contents.
donkeys = pd.read_csv("donkeys.csv")
donkeys.head()
It's always a good idea to look at how much data we have by looking at the dimensions of the dataset. If we have a large number of observations, printing out the entire dataframe may crash our notebook.
donkeys.shape
The dataset is relatively small, with only 544 rows of observations and 8 columns. Let's look at what columns are available to us.
donkeys.columns.values
A good understanding of our data can guide our analysis, so we should understand what each of these columns represent. A few of these columns are self-explanatory, but others require a little more explanation:
BCS: Body Condition Score (a physical health rating)Girth: the measurement around the middle of the donkeyWeightAlt: the second weighing (31 donkeys in our data were weighed twice in order to check the accuracy of the scale)It is also a good idea to determine which variables are quantitative and which are categorical.
Quantitative: Length, Girth, Height, Weight, WeightAlt
Categorical: BCS, Age, Sex
In this section, we will check the data for any abnormalities that we have to deal with.
By examining WeightAlt more closely, we can make sure that the scale is accurate by taking the difference between the two different weighings and plotting them.
difference = donkeys['WeightAlt'] - donkeys['Weight']
sns.distplot(difference.dropna());
The measurements are all within 1 kg of each other, which seems reasonable.
Next, we can look for unusual values that might indicate errors or other problems. We can use the quantile function in order to detect anomalous values.
donkeys.quantile([0.005, 0.995])
For each of these numerical columns, we can look at which rows fall outside of these quantiles and what values they take on. Consider that we want our model to apply to only healthy and mature donkeys.
First, let's look at the BCS column.
donkeys[(donkeys['BCS'] < 1.5) | (donkeys['BCS'] > 4)]['BCS']
Also looking at the barplot of BCS:
plt.hist(donkeys['BCS'], density=True)
plt.xlabel('BCS');
Considering that BCS is an indication of the health of a donkey, a BCS of 1 represents an extremely emaciated donkey and a BCS of 4.5 an overweight donkey. Also looking at the barplot, there only appear to be two donkeys with such outlying BCS values. Thus, we remove these two donkeys.
Now, let's look at Length, Girth, and Height.
donkeys[(donkeys['Length'] < 71.145) | (donkeys['Length'] > 111)]['Length']
donkeys[(donkeys['Girth'] < 90) | (donkeys['Girth'] > 131.285)]['Girth']
donkeys[(donkeys['Height'] < 89) | (donkeys['Height'] > 112)]['Height']
For these three columns, the donkey in row 8 seems to have a much smaller value than the cut-off while the other anomalous donkeys are close to the cut-off and likely do not need to be removed.
Finally, let's take a look at Weight.
donkeys[(donkeys['Weight'] < 71.715) | (donkeys['Weight'] > 214)]['Weight']
The first 2 and last 2 donkeys in the list are far off from the cut-off and most likely should be removed. The middle donkey can be included.
Since WeightAlt closely corresponds to Weight, we skip checking this column for anomalies. Summarizing what we have learned, here is how we want to filter our donkeys:
BCS in the range 1.5 and 4Weight between 71 and 214 donkeys_c = donkeys[(donkeys['BCS'] >= 1.5) & (donkeys['BCS'] <= 4) &
(donkeys['Weight'] >= 71) & (donkeys['Weight'] <= 214)]
Before we proceed with our data analysis, we divide our data into an 80/20 split, using 80% of our data to train our model and setting aside the other 20% for evaluation of the model.
X_train, X_test, y_train, y_test = train_test_split(donkeys_c.drop(['Weight'], axis=1),
donkeys_c['Weight'],
test_size=0.2,
random_state=42)
X_train.shape, X_test.shape
Let's also create a function that evaluates our predictions on the test set. Let's use mean squared error.
def mse_test_set(predictions):
return float(np.sum((predictions - y_test) ** 2))
As usual, we will explore our data before attempting to fit a model to it.
First, we will examine the categorical variables with boxplots.
# HIDDEN
sns.boxplot(x=X_train['BCS'], y=y_train);
It seems like median weight increases with BCS, but not linearly.
# HIDDEN
sns.boxplot(x=X_train['Sex'], y=y_train,
order = ['female', 'stallion', 'gelding']);
It seems like the sex of the donkey doesn't appear to cause much of a difference in weight.
# HIDDEN
sns.boxplot(x=X_train['Age'], y=y_train,
order = ['<2', '2-5', '5-10', '10-15', '15-20', '>20']);
For donkeys over 5, the weight distribution is not too different.
Now, let's look at the quantitative variables. We can plot each of them against the target variable.
# HIDDEN
X_train['Weight'] = y_train
sns.regplot('Length', 'Weight', X_train, fit_reg=False);
# HIDDEN
sns.regplot('Girth', 'Weight', X_train, fit_reg=False);
# HIDDEN
sns.regplot('Height', 'Weight', X_train, fit_reg=False);
All three of our quantitative features have a linear relationship with our target variable of Weight, so we will not have to perform any transformations on our input data.
It is also a good idea to see if our features are linear with each other. We plot two below:
# HIDDEN
sns.regplot('Height', 'Length', X_train, fit_reg=False);
# HIDDEN
sns.regplot('Height', 'Girth', X_train, fit_reg=False);
From these plots, we can see that our predictor variables also have strong linear relationships with each other. This makes our model harder to interpret, so we should keep this in mind after we create our model.
Rather than using all of our data at once, let's try to fit linear models to one or two variables first.
Below are three simple linear regression models using just one quantitative variable. Which model appears to be the best?
# HIDDEN
sns.regplot('Length', 'Weight', X_train, fit_reg=True);
# HIDDEN
model = LinearRegression()
model.fit(X_train[['Length']], X_train['Weight'])
predictions = model.predict(X_test[['Length']])
print("MSE:", mse_test_set(predictions))
sns.regplot('Girth', 'Weight', X_train, fit_reg=True);
# HIDDEN
model = LinearRegression()
model.fit(X_train[['Girth']], X_train['Weight'])
predictions = model.predict(X_test[['Girth']])
print("MSE:", mse_test_set(predictions))
sns.regplot('Height', 'Weight', X_train, fit_reg=True);
# HIDDEN
model = LinearRegression()
model.fit(X_train[['Height']], X_train['Weight'])
predictions = model.predict(X_test[['Height']])
print("MSE:", mse_test_set(predictions))
Looking at the scatterplots and the mean squared errors, it seems like Girth is the best sole predictor of Weight as it has the strongest linear relationship with Weight and the smallest mean squared error.
Can we do better with two variables? Let's try fitting a linear model using both Girth and Length. Although it is not as easy to visualize this model, we can still look at the MSE of this model.
# HIDDEN
model = LinearRegression()
model.fit(X_train[['Girth', 'Length']], X_train['Weight'])
predictions = model.predict(X_test[['Girth', 'Length']])
print("MSE:", mse_test_set(predictions))
Wow! Looks like our MSE went down from around 13000 with just Girth alone to 10000 with Girth and Length. Using including the second variable improved our model.
We can also use categorical variables in our model. Let's now look at a linear model using the categorical variable of Age. This is the plot of Age versus Weight:
# HIDDEN
sns.stripplot(x='Age', y='Weight', data=X_train, order=['<2', '2-5', '5-10', '10-15', '15-20', '>20']);
Seeing how Age is a categorical variable, we need to introduce dummy variables in order to produce a linear regression model.
# HIDDEN
just_age_and_weight = X_train[['Age', 'Weight']]
with_age_dummies = pd.get_dummies(just_age_and_weight, columns=['Age'])
model = LinearRegression()
model.fit(with_age_dummies.drop('Weight', axis=1), with_age_dummies['Weight'])
just_age_and_weight_test = X_test[['Age']]
with_age_dummies_test = pd.get_dummies(just_age_and_weight_test, columns=['Age'])
predictions = model.predict(with_age_dummies_test)
print("MSE:", mse_test_set(predictions))
A MSE of around 40000 is worse than what we could get using any single one of the quantitative variables, but this variable could still prove to be useful in our linear model.
Let's try to interpret this linear model. Note that every donkey that falls into an age category, say 2-5 years of age, will receive the same prediction because they share the input values: a 1 in the column corresponding to 2-5 years of age, and 0 in all other columns. Thus, we can interpret categorical variables as simply changing the constant in the model because the categorical variable separates the donkeys into groups and gives one prediction for all donkeys within that group.
Our next step is to create a final model using both our categorical variables and multiple quantitative variables.
Recall from our boxplots that Sex was not a useful variable, so we will drop it. We will also remove the WeightAlt column because we only have its value for 31 donkeys. Finally, using get_dummies, we transform the categorical variables BCS and Age into dummy variables so that we can include them in the model.
# HIDDEN
X_train.drop('Weight', axis=1, inplace=True)
# HIDDEN
pd.set_option('max_columns', 15)
X_train.drop(['Sex', 'WeightAlt'], axis=1, inplace=True)
X_train = pd.get_dummies(X_train, columns=['BCS', 'Age'])
X_train.head()
Recall that we noticed that the weight distribution of donkeys over the age of 5 is not very different. Thus, let's combine the columns Age_10-15, Age_15-20, and Age_>20 into one column.
age_over_10 = X_train['Age_10-15'] | X_train['Age_15-20'] | X_train['Age_>20']
X_train['Age_>10'] = age_over_10
X_train.drop(['Age_10-15', 'Age_15-20', 'Age_>20'], axis=1, inplace=True)
Since we do not want our matrix to be over-parameterized, we should drop one category from the BCS and Age dummies.
X_train.drop(['BCS_3.0', 'Age_5-10'], axis=1, inplace=True)
X_train.head()
We should also add a column of biases in order to have a constant term in our model.
X_train = X_train.assign(bias=1)
# HIDDEN
X_train = X_train.reindex(columns=['bias'] + list(X_train.columns[:-1]))
X_train.head()
We are finally ready to fit our model to all of the variables we have deemed important and transformed into the proper form.
Our model looks like this:
Here are the functions we defined in the multiple linear regression lesson, which we will use again:
def linear_model(thetas, X):
'''Returns predictions by a linear model on x_vals.'''
return X @ thetas
def mse_cost(thetas, X, y):
return np.mean((y - linear_model(thetas, X)) ** 2)
def grad_mse_cost(thetas, X, y):
n = len(X)
return -2 / n * (X.T @ y - X.T @ X @ thetas)
In order to use the above functions, we need X, and y. These can both be obtained from our data frames. Remember that X and y have to be numpy matrices in order to be able to multiply them with @ notation.
X_train = X_train.values
y_train = y_train.values
Now we just need to call the minimize function defined in a previous section.
thetas = minimize(mse_cost, grad_mse_cost, X_train, y_train)
Our linear model is:
Let's compare this equation that we obtained to the one we would get if we had used sklearn's LinearRegression model instead.
model = LinearRegression(fit_intercept=False) # We already accounted for it with the bias column
model.fit(X_train[:, :14], y_train)
print("Coefficients", model.coef_)
The coefficients look exactly the same! Our homemade functions create the same model as an established Python package!
We successfully fit a linear model to our donkey data! Nice!
Our next step is to evaluate our model's performance on the test set. We need to perform the same data pre-processing steps on the test set as we did on the training set before we can pass it into our model.
X_test.drop(['Sex', 'WeightAlt'], axis=1, inplace=True)
X_test = pd.get_dummies(X_test, columns=['BCS', 'Age'])
age_over_10 = X_test['Age_10-15'] | X_test['Age_15-20'] | X_test['Age_>20']
X_test['Age_>10'] = age_over_10
X_test.drop(['Age_10-15', 'Age_15-20', 'Age_>20'], axis=1, inplace=True)
X_test.drop(['BCS_3.0', 'Age_5-10'], axis=1, inplace=True)
X_test = X_test.assign(bias=1)
# HIDDEN
X_test = X_test.reindex(columns=['bias'] + list(X_test.columns[:-1]))
X_test
We pass X_test into predict of our LinearRegression model:
X_test = X_test.values
predictions = model.predict(X_test)
Let's look at the mean squared error:
mse_test_set(predictions)
With these predictions, we can also make a residual plot:
# HIDDEN
y_test = y_test.values
resid = y_test - predictions
resid_prop = resid / y_test
plt.scatter(np.arange(len(resid_prop)), resid_prop, s=15)
plt.axhline(0)
plt.title('Residual proportions (resid / actual Weight)')
plt.xlabel('Index of row in data')
plt.ylabel('Error proportion');
Looks like our model does pretty well! The residual proportions indicate that our predictions are mostly within 15% of the correct value.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/14'))
Feature engineering refers to the practice of creating and adding new features to the dataset itself in order to add complexity to our models.
So far we have only conducted linear regression using numerical features as the input—we used the (numeric) total bill in order to predict the tip amount. However, the tip dataset also contained categorical data, such as the day of week and the meal type. Feature engineering allows us to convert categorical variables into numerical features for linear regression.
Feature engineering also allows us to use our linear regression model to conduct polynomial regression by creating new variables in our dataset.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/14'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
In 2014, Walmart released some of its sales data as part of a competition to predict the weekly sales of its stores. We've taken a subset of their data and loaded it below.
walmart = pd.read_csv('walmart.csv')
walmart
The data contains several interesting features, including whether a week contained a holiday (IsHoliday), the unemployment rate that week (Unemployment), and which special deals the store offered that week (MarkDown).
Our goal is to create a model that predicts the Weekly_Sales variable using the other variables in our data. Using a linear regression model we directly can use the Temperature, Fuel_Price, and Unemployment columns because they contain numerical data.
In previous sections we have seen how to take the gradient of the cost function and use gradient descent to fit a model. To do this, we had to define Python functions for our model, the cost function, the gradient of the cost function, and the gradient descent algorithm. While this was important to demonstrate how the concepts work, in this section we will instead use a machine learning library called scikit-learn which allows us to fit a model with less code.
For example, to fit a multiple linear regression model using the numerical columns in the Walmart dataset, we first create a two-dimensional NumPy array containing the variables used for prediction and a one-dimensional array containing the values we want to predict:
numerical_columns = ['Temperature', 'Fuel_Price', 'Unemployment']
X = walmart[numerical_columns].as_matrix()
X
y = walmart['Weekly_Sales'].as_matrix()
y
Then, we import the LinearRegression class from scikit-learn (docs), instantiate it, and call the fit method using X to predict y.
Note that previously we had to manually add a column of all 's to the X matrix in order to conduct linear regression with an intercept. This time, scikit-learn will take care of the intercept column behind the scenes, saving us some work.
from sklearn.linear_model import LinearRegression
simple_classifier = LinearRegression()
simple_classifier.fit(X, y)
We are done! When we called .fit, scikit-learn found the linear regression parameters that minimized the least squares cost function. We can see the parameters below:
simple_classifier.coef_, simple_classifier.intercept_
To calculate the mean squared cost, we can ask the classifier to make predictions for the input data X and compare the predictions with the actual values y.
predictions = simple_classifier.predict(X)
np.mean((predictions - y) ** 2)
The mean squared error looks quite high. This is likely because our variables (temperature, price of fuel, and unemployment rate) are only weakly correlated with the weekly sales.
There are two more variables in our data that might be more useful for prediction: the IsHoliday column and MarkDown column. The boxplot below shows that holidays may have some relation with the weekly sales.
sns.pointplot(x='IsHoliday', y='Weekly_Sales', data=walmart);
The different markdown categories seem to correlate with different weekly sale amounts well.
markdowns = ['No Markdown', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']
plt.figure(figsize=(7, 5))
sns.pointplot(x='Weekly_Sales', y='MarkDown', data=walmart, order=markdowns);
However, both IsHoliday and MarkDown columns contain categorical data, not numerical, so we cannot use them as-is for regression.
Fortunately, we can perform a one-hot encoding transformation on these categorical variables to convert them into numerical variables. The transformation works as follows: create a new column for every unique value in a categorical variable. The column contains a if the variable originally had the corresponding value, otherwise the column contains a . For example, the MarkDown column below contains the following values:
# HIDDEN
walmart[['MarkDown']]
This variable contains six different unique values: 'No Markdown', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', and 'MarkDown5'. We create one column for each value to get six columns in total. Then, we fill in the columns with zeros and ones according the scheme described above.
# HIDDEN
from sklearn.feature_extraction import DictVectorizer
items = walmart[['MarkDown']].to_dict(orient='records')
encoder = DictVectorizer(sparse=False)
pd.DataFrame(
data=encoder.fit_transform(items),
columns=encoder.feature_names_
)
Notice that the first value in the data is "No Markdown", and thus only the last column of the first row in the transformed table is marked with . In addition, the last value in the data is "MarkDown1" which results in the first column of row 142 marked as .
Each row of the resulting table will contain a single column containing ; the rest will contain . The name "one-hot" reflects the fact that only one column is "hot" (marked with a ).
To perform one-hot encoding we can use scikit-learn's DictVectorizer class. To use the class, we have to convert our dataframe into a list of dictionaries. The DictVectorizer class automatically one-hot encodes the categorical data (which needs to be strings) and leaves numerical data untouched.
from sklearn.feature_extraction import DictVectorizer
all_columns = ['Temperature', 'Fuel_Price', 'Unemployment', 'IsHoliday',
'MarkDown']
records = walmart[all_columns].to_dict(orient='records')
encoder = DictVectorizer(sparse=False)
encoded_X = encoder.fit_transform(records)
encoded_X
To get a better sense of the transformed data, we can display it with the column names:
pd.DataFrame(data=encoded_X, columns=encoder.feature_names_)
The numerical variables (fuel price, temperature, and unemployment) are left as numbers. The categorical variables (holidays and markdown) are one-hot encoded. When we use the new matrix of data to fit a linear regression model, we will generate one parameter for each column of the data. Since this data matrix contains eleven columns, the model will have twelve parameters since we fit extra parameter for the intercept term.
We can now use the encoded_X variable for linear regression.
clf = LinearRegression()
clf.fit(encoded_X, y)
As promised, we have eleven parameters for the columns and one intercept parameter.
clf.coef_, clf.intercept_
We can compare a few of the predictions from both classifiers to see whether there's a large difference between the two.
walmart[['Weekly_Sales']].assign(
pred_numeric=simple_classifier.predict(X),
pred_both=clf.predict(encoded_X)
)
It appears that both models make very similar predictions. A scatter plot of both sets of predictions confirms this.
plt.scatter(simple_classifier.predict(X), clf.predict(encoded_X))
plt.title('Predictions using all data vs. numerical features only')
plt.xlabel('Predictions using numerical features')
plt.ylabel('Predictions using all features');
Why might this be the case? We can examine the parameters that both models learn. The table below shows the weights learned by the classifier that only used numerical variables without one-hot encoding:
# HIDDEN
def clf_params(names, clf):
weights = (
np.append(clf.coef_, clf.intercept_)
)
return pd.DataFrame(weights, names + ['Intercept'])
clf_params(numerical_columns, simple_classifier)
The table below shows the weights learned by the classifier with one-hot encoding.
# HIDDEN
pd.options.display.max_rows = 13
display(clf_params(encoder.feature_names_, clf))
pd.options.display.max_rows = 7
We can see that even when we fit a linear regression model using one-hot encoded columns the weights for fuel price, temperature, and unemployment are very similar to the previous values. All the weights are small in comparison to the intercept term, suggesting that most of the variables are still only slightly correlated with the actual sale amounts. In fact, the model weights for the IsHoliday variable are so low that it makes nearly no difference in prediction whether the date was a holiday or not. Although some of the MarkDown weights are rather large, many markdown events only appear a few times in the dataset.
walmart['MarkDown'].value_counts()
This suggests that we probably need to collect more data in order for the model to better utilize the effects of markdown events on the sale amounts. (In reality, the dataset shown here is a small subset of a much larger dataset released by Walmart. It will be a useful exercise to train a model using the entire dataset instead of a small subset.)
We have learned to use one-hot encoding, a useful technique for conducting linear regression on categorical data. Although in this particular example the transformation didn't affect our model very much, in practice the technique is used widely when working with categorical data. One-hot encoding also illustrates the general principle of feature engineering—it takes an original data matrix and transforms it into a potentially more useful one.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/14'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + nrows, col:col + ncols]
if len(df.columns) <= ncols:
interact(peek, row=(0, len(df) - nrows, nrows), col=fixed(0))
else:
interact(peek,
row=(0, len(df) - nrows, nrows),
col=(0, len(df.columns) - ncols))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
# HIDDEN
# To determine which columns to regress
# ice_orig = pd.read_csv('icecream_orig.csv')
# cols = ['aerated', 'afterfeel', 'almond', 'buttery', 'color', 'cooling',
# 'creamy', 'doughy', 'eggy', 'fat', 'fat_level', 'fatty', 'hardness',
# 'ice_crystals', 'id', 'liking_flavor', 'liking_texture', 'melt_rate',
# 'melting_rate', 'milky', 'sugar', 'sugar_level', 'sweetness',
# 'tackiness', 'vanilla']
# melted = ice_orig.melt(id_vars='overall', value_vars=cols, var_name='type')
# sns.lmplot(x='value', y='overall', col='type', col_wrap=5, data=melted,
# sharex=False, fit_reg=False)
Suppose we are trying to create new, popular ice cream flavors. We are interested in the following regression problem: given the sweetness of an ice cream flavor, predict its overall taste rating out of 7.
ice = pd.read_csv('icecream.csv')
ice
Although we expect that an ice cream flavor that is not sweet enough would receive a low rating, we also expect that an ice flavor that is too sweet would also receive a low rating. This is reflected in the scatter plot of overall rating and sweetness:
# HIDDEN
sns.lmplot(x='sweetness', y='overall', data=ice, fit_reg=False)
plt.title('Overall taste rating vs. sweetness');
Unfortunately, a linear model alone cannot take this increase-then-decrease behavior into account; in a linear model, the overall rating can only increase or decrease monotonically with the sweetness. We can see that using linear regression results in a poor fit.
# HIDDEN
sns.lmplot(x='sweetness', y='overall', data=ice)
plt.title('Overall taste rating vs. sweetness');
One useful approach for this problem is to fit a polynomial curve instead of line. Such a curve would allow us to model the fact that the overall rating increases with sweetness only up to a certain point, then decreases as sweetness increases.
With a feature engineering technique, we can simply add new columns to our data to use our linear model for polynomial regression.
Recall that in linear regression we fit one weight for each column of our data matrix . In this case, our matrix contains two columns: a column of all ones and the sweetness.
# HIDDEN
from sklearn.preprocessing import PolynomialFeatures
first_X = PolynomialFeatures(degree=1).fit_transform(ice[['sweetness']])
pd.DataFrame(data=first_X, columns=['bias', 'sweetness'])
Our model is thus:
We can create a new column in containing the squared values of the sweetness.
# HIDDEN
second_X = PolynomialFeatures(degree=2).fit_transform(ice[['sweetness']])
pd.DataFrame(data=second_X, columns=['bias', 'sweetness', 'sweetness^2'])
Since our model learns one weight for each column of its input matrix, our model will become:
$$ f_\hat{\theta} (x) = \hat{\theta_0}
+ \hat{\theta_1} \cdot \text{sweetness}
+ \hat{\theta_2} \cdot \text{sweetness}^2
$$
Our model now fits a polynomial with degree two to our data. We can easily fit higher degree polynomials by adding columns for , , and so on.
Notice that this model is still a linear model because it is linear in its parameters—each is a scalar value of degree one. However, the model is polynomial in its features because its input data contains a column that is a polynomial transformation of another column.
To conduct polynomial regression, we use a linear model with polynomial features. Thus, we import the LinearRegression model and PolynomialFeatures transform from scikit-learn.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
Our original data matrix contains the following values. Remember that we include the column and row labels for reference purposes only; the actual matrix only contains the numerical data in the table below.
ice[['sweetness']]
We first use the PolynomialFeatures class to transform the data, adding polynomial features of degree 2.
transformer = PolynomialFeatures(degree=2)
X = transformer.fit_transform(ice[['sweetness']])
X
Now, we fit a linear model to this data matrix.
clf = LinearRegression(fit_intercept=False)
clf.fit(X, ice['overall'])
clf.coef_
The parameters above show that for this dataset, the best-fit model is:
We can now compare this model's predictions against the original data.
# HIDDEN
sns.lmplot(x='sweetness', y='overall', data=ice, fit_reg=False)
xs = np.linspace(3.5, 12.5, 1000).reshape(-1, 1)
ys = clf.predict(transformer.transform(xs))
plt.plot(xs, ys)
plt.title('Degree 2 polynomial fit');
This model looks like a much better fit than our linear model. We can also verify that the mean squared cost for the degree 2 polynomial fit is much lower than the cost for the linear fit.
# HIDDEN
y = ice['overall']
pred_linear = (
LinearRegression(fit_intercept=False).fit(first_X, y).predict(first_X)
)
pred_quad = clf.predict(X)
def mse_cost(pred, y): return np.mean((pred - y) ** 2)
print(f'MSE cost for linear reg: {mse_cost(pred_linear, y):.3f}')
print(f'MSE cost for deg 2 poly reg: {mse_cost(pred_quad, y):.3f}')
As mentioned earlier, we are free to add higher degree polynomial features to our data. For example, we can easily create polynomial features of degree 5:
# HIDDEN
second_X = PolynomialFeatures(degree=5).fit_transform(ice[['sweetness']])
pd.DataFrame(data=second_X,
columns=['bias', 'sweetness', 'sweetness^2', 'sweetness^3',
'sweetness^4', 'sweetness^5'])
Fitting a linear model using these features results in a degree five polynomial regression.
# HIDDEN
trans_five = PolynomialFeatures(degree=5)
X_five = trans_five.fit_transform(ice[['sweetness']])
clf_five = LinearRegression(fit_intercept=False).fit(X_five, y)
sns.lmplot(x='sweetness', y='overall', data=ice, fit_reg=False)
xs = np.linspace(3.5, 12.5, 1000).reshape(-1, 1)
ys = clf_five.predict(trans_five.transform(xs))
plt.plot(xs, ys)
plt.title('Degree 5 polynomial fit');
The plot shows that a degree five polynomial seems to fit the data roughly as well as a degree two polynomial. In fact, the mean squared cost for the degree five polynomial is almost half of the cost for the degree two polynomial.
pred_five = clf_five.predict(X_five)
print(f'MSE cost for linear reg: {mse_cost(pred_linear, y):.3f}')
print(f'MSE cost for deg 2 poly reg: {mse_cost(pred_quad, y):.3f}')
print(f'MSE cost for deg 5 poly reg: {mse_cost(pred_five, y):.3f}')
This suggests that we might do even better by increasing the degree even more. Why not a degree 10 polynomial?
# HIDDEN
trans_ten = PolynomialFeatures(degree=10)
X_ten = trans_ten.fit_transform(ice[['sweetness']])
clf_ten = LinearRegression(fit_intercept=False).fit(X_ten, y)
sns.lmplot(x='sweetness', y='overall', data=ice, fit_reg=False)
xs = np.linspace(3.5, 12.5, 1000).reshape(-1, 1)
ys = clf_ten.predict(trans_ten.transform(xs))
plt.plot(xs, ys)
plt.title('Degree 10 polynomial fit')
plt.ylim(3, 7);
Here are the mean squared costs for the regression models we've seen thus far:
# HIDDEN
pred_ten = clf_ten.predict(X_ten)
print(f'MSE cost for linear reg: {mse_cost(pred_linear, y):.3f}')
print(f'MSE cost for deg 2 poly reg: {mse_cost(pred_quad, y):.3f}')
print(f'MSE cost for deg 5 poly reg: {mse_cost(pred_five, y):.3f}')
print(f'MSE cost for deg 10 poly reg: {mse_cost(pred_ten, y):.3f}')
The degree 10 polynomial has a cost of zero! This makes sense if we take a closer look at the plot; the degree ten polynomial manages to pass through the precise location of each point in the data.
However, you should feel hesitant to use the degree 10 polynomial to predict ice cream ratings. Intuitively, the degree 10 polynomial seems to fit our specific set of data too closely. If we take another set of data and plot them on the scatter plot above, we can expect that they fall close to our original set of data. When we do this, however, the degree 10 polynomial suddenly seems like a poor fit while the degree 2 polynomial still looks reasonable.
# HIDDEN
# sns.lmplot(x='sweetness', y='overall', data=ice, fit_reg=False)
np.random.seed(1)
x_devs = np.random.normal(scale=0.4, size=len(ice))
y_devs = np.random.normal(scale=0.4, size=len(ice))
plt.figure(figsize=(10, 5))
# Degree 10
plt.subplot(121)
ys = clf_ten.predict(trans_ten.transform(xs))
plt.plot(xs, ys)
plt.scatter(ice['sweetness'] + x_devs,
ice['overall'] + y_devs,
c='g')
plt.title('Degree 10 poly, second set of data')
plt.ylim(3, 7);
plt.subplot(122)
ys = clf.predict(transformer.transform(xs))
plt.plot(xs, ys)
plt.scatter(ice['sweetness'] + x_devs,
ice['overall'] + y_devs,
c='g')
plt.title('Degree 2 poly, second set of data')
plt.ylim(3, 7);
We can see that in this case, degree two polynomial features work better than both no transformation and degree ten polynomial features.
This raises the natural question: in general, how do we determine which degree polynomial to fit? Although we are tempted to use the cost on the training dataset to pick the best polynomial, we have seen that using this cost can pick a model that is too complex. Instead, we want to evaluate our model on data that is not used to fit parameters.
In this section, we introduce another feature engineering technique: adding polynomial features to the data in order to perform polynomial regression. Like one-hot encoding, adding polynomial features allows us to use our linear regression model effectively on more types of data.
We have also encountered a fundamental issue with feature engineering. Adding many features to the data gives the model a lower cost on its original set of data but often results in a less accurate model on new sets of data.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/15'))
Sometimes, we choose a model that is too simple to represent the underlying data generation process. Other times, we choose a model that is too complex—it fits the noise in the data rather than the data's overall pattern.
To understand why this happens, we analyze our models using the tools of probability and statistics. These tools allow us to generalize beyond a few isolated examples to describe fundamental phenomena in modeling. In particular, we will use the tools of expectation and variance to uncover the bias-variance tradeoff.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/15'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + nrows, col:col + ncols]
if len(df.columns) <= ncols:
interact(peek, row=(0, len(df) - nrows, nrows), col=fixed(0))
else:
interact(peek,
row=(0, len(df) - nrows, nrows),
col=(0, len(df.columns) - ncols))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
In order to make predictions using data, we define a model, select a loss function across the entire dataset, and fit the model's parameters by minimizing the loss. For example, to conduct least squares linear regression, we select the model:
And the loss function:
As before, we use as our vector of model parameters, as a vector containing a row of a data matrix , and as our vector of observed values to predict. is the 'th row of and is the 'th entry of y.
Observe that our lost function across the dataset is the average of the loss function values for each row of our data. If we define the squared loss function:
Then we may rewrite our average loss function more simply:
The expression above abstracts over the specific loss function; regardless of the loss function we choose, our overall loss is the average loss.
By minimizing the average loss, we select the model parameters that best fit our observed dataset. Thus far, we have refrained from making statements about the population that generated the dataset. In reality, however, we are quite interested in making good predictions on the entire population, not just our data that we have already seen.
If our observed dataset and are drawn at random from a given population, our observed data are random variables. If our observed data are random variables, our model parameters are also random variables—each time we collect a new set of data and fit a model, the parameters of the model will be slightly different.
Suppose we draw one more input-output pair from our population at random. The loss that our model produces on this value is:
Notice that this loss is a random variable; the loss changes for different sets of observed data and and different points from our population.
The risk for a model is the expected value of the loss above for all training data , and all points , in the population:
Notice that the risk is an expectation of a random variable and is thus not random itself. The expected value of fair six-sided die rolls is 3.5 even though the rolls themselves are random.
The risk above is sometimes called the true risk because it tells how a model does on the entire population. If we could compute the true risk for all models, we can simply pick the model with the least risk and know with certainty that the model will perform better in the long run than all other models on our choice of loss function.
Reality, however, is not so kind. If we substitute in the definition of expectation into the formula for the true risk, we get:
To further simplify this expression, we need to know , the global probability distribution of observing any point in the population. Unfortunately, this is not so easy. Suppose we are trying to predict the tip amount based on the size of the table. What is the probability that a table of three people gives a tip of $14.50? If we knew the distribution of points exactly, we wouldn't have to collect data or fit a model—we would already know the most likely tip amount for any given table.
Although we do not know the exact distribution of the population, we can approximate it using the observed dataset and . If and are drawn at random from our population, the distribution of points in and is similar to the population distribution. Thus, we treat and as our population. Then, the probability that any input-output pair , appears is since each pair appears once out of points total.
This allows us to calculate the empirical risk, an approximation for the true risk:
If our dataset is large and the data are drawn at random from the population, the empirical risk is close to the true risk . This allows us to pick the model that minimizes the empirical risk.
Notice that this expression is the average loss function at the start of the section! By minimizing the average loss, we also minimize the empirical risk. This explains why we often use the average loss as our overall loss function instead of the maximum loss, for example.
The true risk of a prediction model describes the overall long-run loss that the model will produce for the population. Since we typically cannot calculate the true risk directly, we calculate the empirical risk instead and use the empirical risk to find an appropriate model for prediction. Because the empirical risk is the average loss on the observed dataset, we often minimize the average loss when fitting models.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/15'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + nrows, col:col + ncols]
if len(df.columns) <= ncols:
interact(peek, row=(0, len(df) - nrows, nrows), col=fixed(0))
else:
interact(peek,
row=(0, len(df) - nrows, nrows),
col=(0, len(df.columns) - ncols))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
We have previously seen that our choice of model has two basic sources of error.
Our model may be too simple—a linear model is not able to properly fit data generated from a quadratic process, for example. This type of error arises from model bias.
Our model may also fit the random noise present in the data—even if we fit a quadratic process using a quadratic model, the model may predict different outcomes than the true process produces. This type of error arises from model variance.
We can make the statements above more precise by decomposing our formula for model risk. Recall that the risk for a model is the expected loss for all possible sets of training data , and all input-output points , in the population:
We denote the process that generates the true population data as . The output point is generated by our population process plus some random noise in data collection: . The random noise is a random variable with a mean of zero: .
If we use the squared error as our loss function, the above expression becomes:
With some algebraic manipulation, we can show that the above expression is equivalent to:
The first term in this expression, , is a mathematical expression for the bias of the model. (Technically, this term represents the bias squared, .) The bias is equal to zero if in the long run our choice of model predicts the same outcomes produced by the population process . The bias is high if our choice of model makes poor predictions of the population process even when we have the entire population as our dataset.
The second term in this expression, , represents the model variance. The variance is low when the model's predictions don't change much when the model is trained on different datasets from the population. The variance is high when the model's predictions change greatly when the model is trained on different datasets from the population.
The third and final term in this expression, , represents the irreducible error or the noise in the data generation and collection process. This term is small when the data generation and collection process is precise or has low variation. This term is large when the data contain large amounts of noise.
To begin the decomposition, we start with the mean squared error:
And expand the square and apply linearity of expectation:
Because and are independent (the model outputs and population observations don't depend on each other), we can say that . We then substitute for :
Simplifiying some more: (Note that because is a deterministic function, given a particular query point .)
Applying linearity of expectation again:
Noting that because :
We can then rewrite the equation as:
Because :
To pick a model that performs well, we seek to minimize the risk. To minimize the risk, we attempt to minimize the bias, variance, and noise terms of the bias-variance decomposition. Decreasing the noise term typically requires improvements to the data collection process—purchasing more precise sensors, for example. To decrease bias and variance, however, we must tune the complexity of our models. Models that are too simple have high bias; models that are too complex have high variance. This is the essence of the bias-variance tradeoff, a fundamental issue that we face in choosing models for prediction.
Suppose we are modeling data generated from the oscillating function shown below.
# HIDDEN
from collections import namedtuple
from sklearn.linear_model import LinearRegression
np.random.seed(42)
Line = namedtuple('Line', ['x_start', 'x_end', 'y_start', 'y_end'])
def f(x): return np.sin(x) + 0.3 * x
def noise(n):
return np.random.normal(scale=0.1, size=n)
def draw(n):
points = np.random.choice(np.arange(0, 20, 0.2), size=n)
return points, f(points) + noise(n)
def fit_line(x, y, x_start=0, x_end=20):
clf = LinearRegression().fit(x.reshape(-1, 1), y)
return Line(x_start, x_end, clf.predict(x_start)[0], clf.predict(x_end)[0])
population_x = np.arange(0, 20, 0.2)
population_y = f(population_x)
avg_line = fit_line(population_x, population_y)
datasets = [draw(100) for _ in range(20)]
random_lines = [fit_line(x, y) for x, y in datasets]
# HIDDEN
plt.plot(population_x, population_y)
plt.title('True underlying data generation process');
If we randomly draw a dataset from the population, we may end up with the following:
# HIDDEN
xs, ys = draw(100)
plt.scatter(xs, ys, s=10)
plt.title('One set of observed data');
Suppose we draw many sets of data from the population and fit a simple linear model to each one. Below, we plot the population data generation scheme in blue and the model predictions in green.
# HIDDEN
plt.figure(figsize=(8, 5))
plt.plot(population_x, population_y)
for x_start, x_end, y_start, y_end in random_lines:
plt.plot([x_start, x_end], [y_start, y_end], linewidth=1, c='g')
plt.title('Population vs. linear model predictions');
The plot above clearly shows that a linear model will make prediction errors for this population. We may decompose the prediction errors into bias, variance, and irreducible noise. We illustrate bias of our model by showing that the long-run average linear model will predict different outcomes than the population process:
plt.figure(figsize=(8, 5))
xs = np.arange(0, 20, 0.2)
plt.plot(population_x, population_y, label='Population')
plt.plot([avg_line.x_start, avg_line.x_end],
[avg_line.y_start, avg_line.y_end],
linewidth=2, c='r',
label='Long-run average linear model')
plt.title('Bias of linear model')
plt.legend();
The variance of our model is the variation of the model predictions around the long-run average model:
plt.figure(figsize=(8, 5))
for x_start, x_end, y_start, y_end in random_lines:
plt.plot([x_start, x_end], [y_start, y_end], linewidth=1, c='g', alpha=0.8)
plt.plot([avg_line.x_start, avg_line.x_end],
[avg_line.y_start, avg_line.y_end],
linewidth=4, c='r')
plt.title('Variance of linear model');
Finally, we illustrate the irreducible error by showing the deviations of the observed points from the underlying population process.
# HIDDEN
plt.plot(population_x, population_y)
xs, ys = draw(100)
plt.scatter(xs, ys, s=10)
plt.title('Irreducible error');
In an ideal world, we would minimize the expected prediction error for our model over all input-output points in the population. However, in practice, we do not know the population data generation process and thus are unable to precisely determine a model's bias, variance, or irreducible error. Instead, we use our observed dataset as an approximation to the population.
As we have seen, however, achieving a low training error does not necessarily mean that our model will have a low test error as well. It is easy to obtain a model with extremely low bias and therefore low training error by fitting a curve that passes through every training observation. However, this model will have high variance which typically leads to high test error. Conversely, a model that predicts a constant has low variance but high bias. Fundamentally, this occurs because training error reflects the bias of our model but not the variance; the test error reflects both. In order to minimize test error, our model needs to simultaneously achieve low bias and low variance. To account for this, we need a way to simulate test error without using the test set. This is generally done using cross validation.
The bias-variance tradeoff allows us to more precisely describe the modeling phenomena that we have seen thus far.
Underfitting is typically caused by too much bias; overfitting is typically caused by too much model variance.
Collecting more data reduces variance. For example, the model variance of linear regression goes down by a factor of , where is the number of data points. Thus, doubling the dataset size halves the model variance, and collecting many data points will cause the variance to approach 0. One recent trend is to select a model with low bias and high intrinsic variance (e.g. a neural network) and collect many data points so that the model variance is low enough to make accurate predictions. While effective in practice, collecting enough data for these models tends to require large amounts of time and money.
Collecting more data reduces bias if the model can fit the population process exactly. If the model is inherently incapable of modeling the population (as in the example above), even infinite data cannot get rid of model bias.
Adding a useful feature to the data, such as a quadratic feature when the underlying process is quadratic, reduces bias. Adding a useless feature rarely increases bias.
Adding a feature, whether useful or not, typically increases model variance since each new feature adds a parameter to the model. Generally speaking, models with many parameters have many possible combinations of parameters and therefore have higher variance than models with few parameters. In order to increase a model's prediction accuracy, a new feature should decrease bias more than it increases variance.
Removing features will typically increase bias and can cause underfitting. For example, a simple linear model has higher model bias than the same model with a quadratic feature added to it. If the data were generated from a quadratic phenomenon, the simple linear model underfits the data.
In the plot below, the X-axis measures model complexity and the Y-axis measures magnitude. Notice how as model complexity increases, model bias strictly decreases and model variance strictly increases. As we choose more complex models, the test error first decreases then increases as the increased model variance outweighs the decreased model bias.

As the plot shows, a model with high complexity can achieve low training error but can fail to generalize to the test set because of its high model variance. On the other hand, a model with low complexity will have low model variance but can also fail to generalize because of its high model bias. To select a useful model, we must strike a balance between model bias and variance.
As we add more data, we shift the curves on our plot to the right and down, reducing bias and variance:

The bias-variance tradeoff reveals a fundamental problem in modeling. In order to minimize model risk, we use a combination of feature engineering, model selection, and cross-validation to balance bias and variance.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/15'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + nrows, col:col + ncols]
if len(df.columns) <= ncols:
interact(peek, row=(0, len(df) - nrows, nrows), col=fixed(0))
else:
interact(peek,
row=(0, len(df) - nrows, nrows),
col=(0, len(df.columns) - ncols))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
In the previous section, we observed that we needed a more accurate way of simulating the test error to manage the bias-variance trade off. To reiterate, training error is misleadingly low, because we are fitting our model on the training set. We need to choose a model without using the test set, so we split our training set again, into a validation set. Cross-validation provides a method of estimating our model error using a single observed dataset by separating data used for training from the data used for model selection and final accuracy.
One way to accomplish this is to split the original dataset into three disjoint subsets:
After splitting, we select a set of features and a model based on the following procedure:
This process allows us to more accurately determine the model to use than using the training error alone. By using cross-validation, we can test our model on data that it wasn't fit on, simulating test error without using the test set. This gives us a sense of how our model performs on unseen data.
Size of the train-validation-test split
The train-validation-test split commonly uses 70% of the data as the training set, 15% as the validation set, and the remaining 15% as the test set. Increasing the size of the training set helps model accuracy but causes more variation in the validation and test error. This is because a smaller validation set and test set are less representative of the sample data.
A model is of little use to us if it fails to generalize to unseen data from the population. The test error provides the most accurate representation of the model's performance on new data since we do not use the test set to train the model or select features.
In general, the training error decreases as we add complexity to our model with additional features or more complex prediction mechanisms. The test error, on the other hand, decreases up to a certain amount of complexity then increases again as the model overfits the training set. This is due the fact that at first, bias decreases more than variance increases. Eventually, the increase in variance surpasses the decrease in bias.

The train-validation-test split method is a good method to simulate test error through the validation set. However, making the three splits results in too little data for training. Also, with this method the validation error may be prone to high variance because the evaluation of the error may depend heavily on which points end up in the training and validation sets.
To tackle this problem, we can run the train-validation split multiple times on the same dataset. The dataset is divided into k equally-sized subsets ( folds), and the train-validation split is repeated k times. Each time, one of the k folds is used as the validation set, and the remaining k-1 folds are used as the training set. We report the model's final validation error as the average of the validation errors from each trial. This method is called k-fold cross-validation.
The diagram below illustrates the technique when using five folds:

The biggest advantage of this method is that every data point is used for validation exactly once and for training k-1 times. Typically, a k between 5 to 10 is used, but k remains an unfixed parameter. When k is small, the error estimate has a lower variance (many validation points) but has a higher bias (fewer training points). Vice versa, with large k the error estimate has lower bias but has higher variance.
-fold cross-validation takes more computation time than the train-validation split since we typically have to refit each model from scratch for each fold. However, it computes a more accurate validation error by averaging multiple errors together for each model.
The scikit-learn library provides a convenient sklearn.model_selection.KFold class to implement -fold cross-validation.
Cross-validation helps us manage the bias-variance tradeoff more accurately. Intuitively, the validation error estimates test error by checking the model's performance on a dataset not used for training; this allows us to estimate both model bias and model variance. K-fold cross-validation also incorporates the fact that the noise in the test set only affects the noise term in the bias-variance decomposition whereas the noise in the training set affects both bias and model variance. To choose the final model to use, we select the one that has the lowest validation error.
We will use the complete model selection process, including cross-validation, to select a model that predicts ice cream ratings from ice cream sweetness. The complete ice cream dataset and a scatter plot of the overall rating versus ice cream sweetness are shown below.
# HIDDEN
ice = pd.read_csv('icecream.csv')
transformer = PolynomialFeatures(degree=2)
X = transformer.fit_transform(ice[['sweetness']])
clf = LinearRegression(fit_intercept=False).fit(X, ice[['overall']])
xs = np.linspace(3.5, 12.5, 300).reshape(-1, 1)
rating_pred = clf.predict(transformer.transform(xs))
temp = pd.DataFrame(xs, columns = ['sweetness'])
temp['overall'] = rating_pred
np.random.seed(42)
x_devs = np.random.normal(scale=0.2, size=len(temp))
y_devs = np.random.normal(scale=0.2, size=len(temp))
temp['sweetness'] = np.round(temp['sweetness'] + x_devs, decimals=2)
temp['overall'] = np.round(temp['overall'] + y_devs, decimals=2)
ice = pd.concat([temp, ice])
ice
# HIDDEN
plt.scatter(ice['sweetness'], ice['overall'], s=10)
plt.title('Ice Cream Rating vs. Sweetness')
plt.xlabel('Sweetness')
plt.ylabel('Rating');
Using degree 10 polynomial features on 9 random points from the dataset result in a perfectly accurate model for those data points. Unfortunately, this model fails to generalize to previously unseen data from the population.
# HIDDEN
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
ice2 = pd.read_csv('icecream.csv')
trans_ten = PolynomialFeatures(degree=10)
X_ten = trans_ten.fit_transform(ice2[['sweetness']])
y = ice2['overall']
clf_ten = LinearRegression(fit_intercept=False).fit(X_ten, y)
# HIDDEN
np.random.seed(1)
x_devs = np.random.normal(scale=0.4, size=len(ice2))
y_devs = np.random.normal(scale=0.4, size=len(ice2))
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.scatter(ice2['sweetness'], ice2['overall'])
xs = np.linspace(3.5, 12.5, 1000).reshape(-1, 1)
ys = clf_ten.predict(trans_ten.transform(xs))
plt.plot(xs, ys)
plt.title('Degree 10 polynomial fit')
plt.ylim(3, 7);
plt.subplot(122)
ys = clf_ten.predict(trans_ten.transform(xs))
plt.plot(xs, ys)
plt.scatter(ice2['sweetness'] + x_devs,
ice2['overall'] + y_devs,
c='g')
plt.title('Degree 10 poly, second set of data')
plt.ylim(3, 7);
Instead of the above method, we first partition our data into training, validation, and test datasets using scikit-learn's sklearn.model_selection.train_test_split method to perform a 70/30% train-test split.
from sklearn.model_selection import train_test_split
test_size = 92
X_train, X_test, y_train, y_test = train_test_split(
ice[['sweetness']], ice['overall'], test_size=test_size, random_state=0)
print(f' Training set size: {len(X_train)}')
print(f' Test set size: {len(X_test)}')
We now fit polynomial regression models using the training set, one for each polynomial degree from 1 to 10.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# First, we add polynomial features to X_train
transformers = [PolynomialFeatures(degree=deg)
for deg in range(1, 11)]
X_train_polys = [transformer.fit_transform(X_train)
for transformer in transformers]
# Display the X_train with degree 5 polynomial features
X_train_polys[4]
We will then perform 5-fold cross-validation on the 10 featurized datasets. To do so, we will define a function that:
KFold.split function to get 5 splits on the training data. Note that split returns the indices of the data for that split.from sklearn.model_selection import KFold
def mse_cost(y_pred, y_actual):
return np.mean((y_pred - y_actual) ** 2)
def compute_CV_error(model, X_train, Y_train):
kf = KFold(n_splits=5)
validation_errors = []
for train_idx, valid_idx in kf.split(X_train):
# split the data
split_X_train, split_X_valid = X_train[train_idx], X_train[valid_idx]
split_Y_train, split_Y_valid = Y_train.iloc[train_idx], Y_train.iloc[valid_idx]
# Fit the model on the training split
model.fit(split_X_train,split_Y_train)
# Compute the RMSE on the validation split
error = mse_cost(split_Y_valid,model.predict(split_X_valid))
validation_errors.append(error)
#average validation errors
return np.mean(validation_errors)
# We train a linear regression classifier for each featurized dataset and perform cross-validation
# We set fit_intercept=False for our linear regression classifier since
# the PolynomialFeatures transformer adds the bias column for us.
cross_validation_errors = [compute_CV_error(LinearRegression(fit_intercept=False), X_train_poly, y_train)
for X_train_poly in X_train_polys]
# HIDDEN
cv_df = pd.DataFrame({'Validation Error': cross_validation_errors}, index=range(1, 11))
cv_df.index.name = 'Degree'
pd.options.display.max_rows = 20
display(cv_df)
pd.options.display.max_rows = 7
We can see that as we use higher degree polynomial features, the validation error decreases and increases again.
# HIDDEN
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.plot(cv_df.index, cv_df['Validation Error'])
plt.scatter(cv_df.index, cv_df['Validation Error'])
plt.title('Validation Error vs. Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('Validation Error');
plt.subplot(122)
plt.plot(cv_df.index, cv_df['Validation Error'])
plt.scatter(cv_df.index, cv_df['Validation Error'])
plt.ylim(0.044925, 0.05)
plt.title('Zoomed In')
plt.xlabel('Polynomial Degree')
plt.ylabel('Validation Error')
plt.tight_layout();
Examining the validation errors reveals that the most accurate model only used degree 2 polynomial features. Thus, we select the degree 2 polynomial model as our final model and fit it on the all of the training data at once. Then, we compute its error on the test set.
best_trans = transformers[1]
best_model = LinearRegression(fit_intercept=False).fit(X_train_polys[1], y_train)
training_error = mse_cost(best_model.predict(X_train_polys[1]), y_train)
validation_error = cross_validation_errors[1]
test_error = mse_cost(best_model.predict(best_trans.transform(X_test)), y_test)
print('Degree 2 polynomial')
print(f' Training error: {training_error:0.5f}')
print(f'Validation error: {validation_error:0.5f}')
print(f' Test error: {test_error:0.5f}')
For future reference, scikit-learn has a cross_val_predict method to automatically perform cross-validation, so we don't have to break the data into training and validation sets ourselves.
Also, note that the test error is higher than the validation error which is higher than the training error. The training error should be the lowest because the model is fit on the training data. Fitting the model minimizes the mean squared error for that dataset. The validation error and the test error are usually higher than the training error because the error is computed on an unknown dataset that the model hasn't seen.
We use the widely useful cross-validation technique to manage the bias-variance tradeoff. After computing a train-validation-test split on the original dataset, we use the following procedure to train and choose a model.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/16'))
Feature engineering can incorporate important information about the data generation process into our model. However, adding features to the data also typically increases the variance of our model and can thus result in worse performance overall. Rather than throwing out features entirely, we can turn to a technique called regularization to reduce the variance of our model while still incorporating as much information about the data as possible.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/16'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + nrows, col:col + ncols]
if len(df.columns) <= ncols:
interact(peek, row=(0, len(df) - nrows, nrows), col=fixed(0))
else:
interact(peek,
row=(0, len(df) - nrows, nrows),
col=(0, len(df.columns) - ncols))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
# HIDDEN
df = pd.read_csv('water_large.csv')
# HIDDEN
from collections import namedtuple
Curve = namedtuple('Curve', ['xs', 'ys'])
def flatten(seq): return [item for subseq in seq for item in subseq]
def make_curve(clf, x_start=-50, x_end=50):
xs = np.linspace(x_start, x_end, num=100)
ys = clf.predict(xs.reshape(-1, 1))
return Curve(xs, ys)
def plot_data(df=df, ax=plt, **kwargs):
ax.scatter(df.iloc[:, 0], df.iloc[:, 1], s=50, **kwargs)
def plot_curve(curve, ax=plt, **kwargs):
ax.plot(curve.xs, curve.ys, **kwargs)
def plot_curves(curves, cols=2):
rows = int(np.ceil(len(curves) / cols))
fig, axes = plt.subplots(rows, cols, figsize=(10, 8),
sharex=True, sharey=True)
for ax, curve, deg in zip(flatten(axes), curves, degrees):
plot_data(ax=ax, label='Training data')
plot_curve(curve, ax=ax, label=f'Deg {deg} poly')
ax.set_ylim(-5e10, 170e10)
ax.legend()
# add a big axes, hide frame
fig.add_subplot(111, frameon=False)
# hide tick and tick label of the big axes
plt.tick_params(labelcolor='none', top='off', bottom='off',
left='off', right='off')
plt.grid(False)
plt.title('Polynomial Regression')
plt.xlabel('Water Level Change (m)')
plt.ylabel('Water Flow (Liters)')
plt.tight_layout()
def print_coef(clf):
reg = clf.named_steps['reg']
print(reg.intercept_)
print(reg.coef_)
# HIDDEN
X = df.iloc[:, [0]].as_matrix()
y = df.iloc[:, 1].as_matrix()
degrees = [1, 2, 8, 12]
clfs = [Pipeline([('poly', PolynomialFeatures(degree=deg, include_bias=False)),
('reg', LinearRegression())])
.fit(X, y)
for deg in degrees]
curves = [make_curve(clf) for clf in clfs]
ridge_clfs = [Pipeline([('poly', PolynomialFeatures(degree=deg, include_bias=False)),
('reg', Ridge(alpha=0.1, normalize=True))])
.fit(X, y)
for deg in degrees]
ridge_curves = [make_curve(clf) for clf in ridge_clfs]
We begin our discussion of regularization with an example that illustrates the importance of regularization.
The following dataset records the amount of water that flows out of a large dam on a particular day in liters and the amount the water level changed on that day in meters.
# HIDDEN
df
Plotting this data shows an upward trend in water flow as the water level becomes more positive.
# HIDDEN
df.plot.scatter(0, 1, s=50);
To model this pattern, we may use a least squares linear regression model. We show the data and the model's predictions on the plot below.
# HIDDEN
df.plot.scatter(0, 1, s=50);
plot_curve(curves[0])
The visualization shows that this model does not capture the pattern in the data—the model has high bias. As we have previously done, we can attempt to address this issue by adding polynomial features to the data. We add polynomial features of degrees 2, 8, and 12; the chart below shows the training data with each model's predictions.
# HIDDEN
plot_curves(curves)
As expected, the degree 12 polynomial matches the training data well but also seems to fit spurious patterns in the data caused by noise. This provides yet another illustration of the bias-variance tradeoff: the linear model has high bias and low variance while the degree 12 polynomial has low bias but high variance.
Examining the coefficients of the degree 12 polynomial model reveals that this model makes predictions according to the following formula:
$$ 207097470825 + 1.8x + 482.6x^2 + 601.5x^3 + 872.8x^4 + 150486.6x^5 \
+ 2156.7x^6 - 307.2x^7 - 4.6x^8 + 0.2x^9 + 0.003x^{10} - 0.00005x^{11} + 0x^{12}
$$
where is the water level change on that day.
The coefficients for the model are quite large, especially for the higher degree terms which contribute significantly to the model's variance ( and , for example).
Recall that our linear model makes predictions according to the following, where is the model weights and is the vector of features:
To fit our model, we minimize the mean squared error cost function, where is used to represent the data matrix and the observed outcomes:
To minimize the cost above, we adjust until we find the best combination of weights regardless of how large the weights are themselves. However, we have found that larger weights for more complex features result in high model variance. If we could instead alter the cost function to penalize large weight values, the resulting model will have lower variance. We use regularization to add this penalty.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/16'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + nrows, col:col + ncols]
if len(df.columns) <= ncols:
interact(peek, row=(0, len(df) - nrows, nrows), col=fixed(0))
else:
interact(peek,
row=(0, len(df) - nrows, nrows),
col=(0, len(df.columns) - ncols))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
# HIDDEN
df = pd.read_csv('water_large.csv')
# HIDDEN
from collections import namedtuple
Curve = namedtuple('Curve', ['xs', 'ys'])
def flatten(seq): return [item for subseq in seq for item in subseq]
def make_curve(clf, x_start=-50, x_end=50):
xs = np.linspace(x_start, x_end, num=100)
ys = clf.predict(xs.reshape(-1, 1))
return Curve(xs, ys)
def plot_data(df=df, ax=plt, **kwargs):
ax.scatter(df.iloc[:, 0], df.iloc[:, 1], s=50, **kwargs)
def plot_curve(curve, ax=plt, **kwargs):
ax.plot(curve.xs, curve.ys, **kwargs)
def plot_curves(curves, cols=2, labels=None):
if labels is None:
labels = [f'Deg {deg} poly' for deg in degrees]
rows = int(np.ceil(len(curves) / cols))
fig, axes = plt.subplots(rows, cols, figsize=(10, 8),
sharex=True, sharey=True)
for ax, curve, label in zip(flatten(axes), curves, labels):
plot_data(ax=ax, label='Training data')
plot_curve(curve, ax=ax, label=label)
ax.set_ylim(-5e10, 170e10)
ax.legend()
# add a big axes, hide frame
fig.add_subplot(111, frameon=False)
# hide tick and tick label of the big axes
plt.tick_params(labelcolor='none', top='off', bottom='off',
left='off', right='off')
plt.grid(False)
plt.title('Polynomial Regression')
plt.xlabel('Water Level Change (m)')
plt.ylabel('Water Flow (Liters)')
plt.tight_layout()
# HIDDEN
def coefs(clf):
reg = clf.named_steps['reg']
return np.append(reg.intercept_, reg.coef_)
def coef_table(clf):
vals = coefs(clf)
return (pd.DataFrame({'Coefficient Value': vals})
.rename_axis('degree'))
# HIDDEN
X = df.iloc[:, [0]].as_matrix()
y = df.iloc[:, 1].as_matrix()
degrees = [1, 2, 8, 12]
clfs = [Pipeline([('poly', PolynomialFeatures(degree=deg, include_bias=False)),
('reg', LinearRegression())])
.fit(X, y)
for deg in degrees]
curves = [make_curve(clf) for clf in clfs]
alphas = [0.01, 0.1, 1.0, 10.0]
ridge_clfs = [Pipeline([('poly', PolynomialFeatures(degree=deg, include_bias=False)),
('reg', RidgeCV(alphas=alphas, normalize=True))])
.fit(X, y)
for deg in degrees]
ridge_curves = [make_curve(clf) for clf in ridge_clfs]
In this section we introduce regularization, a method of penalizing large weights in our cost function to lower model variance. We briefly review linear regression, then introduce regularization as a modification to the cost function.
To perform least squares linear regression, we use the model:
We fit the model by minimizing the mean squared error cost function:
In the above definitions, represents the data matrix, represents a row of , represents the observed outcomes, and represents the model weights.
To add regularization to the model, we modify the cost function above:
$$ \begin{aligned} L(\hat{\theta}, X, y) &= \frac{1}{n} \sum_{i}(yi - f\hat{\theta} (X_i))^2
+ \lambda \sum_{j = 1}^{p} \hat{\theta_j}^2
\end{aligned} $$
Notice that the cost function above is the same as before with the addition of the regularization term. The summation in this term sums the square of each model weight . The term also introduces a new scalar model parameter that adjusts the regularization penalty.
The regularization term causes the cost to increase if the values in are further away from 0. With the addition of regularization, the optimal model weights minimize the combination of loss and regularization penalty rather than the loss alone. Since the resulting model weights tend to be smaller in absolute value, the model has lower variance and higher bias.
Using regularization with a linear model and the mean squared error cost function is also known more commonly as ridge regression.
The regularization parameter controls the regularization penalty. A small results in a small penalty—if the regularization term is also and the cost is not regularized at all.
A large terms results in a large penalty and therefore a simpler model. Increasing decreases the variance and increases the bias of the model. We use cross-validation to select the value of that minimizes the validation error.
Note about regularization in scikit-learn:
scikit-learn provides regression models that have regularization built-in. For example, to conduct ridge regression you may use the sklearn.linear_model.Ridge regression model. Note that scikit-learn models call the regularization parameter alpha instead of .
scikit-learn conveniently provides regularized models that perform cross-validation to select a good value of . For example, the sklearn.linear_model.RidgeCV allows users to input regularization parameter values and will automatically use cross-validation to select the parameter value with the least validation error.
Note that the bias term is not included in the summation of the regularization term. We do not penalize the bias term because increasing the bias term does not increase the variance of our model—the bias term simply shifts all predictions by a constant value.
Notice that the regularization term penalizes each equally. However, the effect of each differs depending on the data itself. Consider this section of the water flow dataset after adding degree 8 polynomial features:
# HIDDEN
pd.DataFrame(clfs[2].named_steps['poly'].transform(X[:5]),
columns=[f'deg_{n}_feat' for n in range(8)])
We can see that the degree 7 polynomial features have much larger values than the degree 1 features. This means that a large model weight for the degree 7 features affects the predictions much more than a large model weight for the degree 1 features. If we apply regularization to this data directly, the regularization penalty will disproportionately lower the model weight for the lower degree features. In practice, this often results in high model variance even after applying regularization since the features with large effect on prediction will not be affected.
To combat this, we normalize each data column by subtracting the mean and scaling the values in each column to be between -1 and 1. In scikit-learn, most regression models allow initializing with normalize=True to normalize the data before fitting.
Another analogous technique is standardizing the data columns by subtracting the mean and dividing by the standard deviation for each data column.
We have previously used polynomial features to fit polynomials of degree 2, 8, and 12 to water flow data. The original data and resulting model predictions are repeated below.
# HIDDEN
df
# HIDDEN
plot_curves(curves)
To conduct ridge regression, we first extract the data matrix and the vector of outcomes from the data:
X = df.iloc[:, [0]].as_matrix()
y = df.iloc[:, 1].as_matrix()
print('X: ')
print(X)
print()
print('y: ')
print(y)
Then, we apply a degree 12 polynomial transform to X:
from sklearn.preprocessing import PolynomialFeatures
# We need to specify include_bias=False since sklearn's classifiers
# automatically add the intercept term.
X_poly_8 = PolynomialFeatures(degree=8, include_bias=False).fit_transform(X)
print('First two rows of transformed X:')
print(X_poly_8[0:2])
We specify alpha values that scikit-learn will select from using cross-validation, then use the RidgeCV classifier to fit the transformed data.
from sklearn.linear_model import RidgeCV
alphas = [0.01, 0.1, 1.0, 10.0]
# Remember to set normalize=True to normalize data
clf = RidgeCV(alphas=alphas, normalize=True).fit(X_poly_8, y)
# Display the chosen alpha value:
clf.alpha_
Finally, we plot the model predictions for the base degree 8 polynomial classifier next to the regularized degree 8 classifier:
# HIDDEN
fig = plt.figure(figsize=(10, 5))
plt.subplot(121)
plot_data()
plot_curve(curves[2])
plt.title('Base degree 8 polynomial')
plt.subplot(122)
plot_data()
plot_curve(ridge_curves[2])
plt.title('Regularized degree 8 polynomial')
plt.tight_layout()
We can see that the regularized polynomial is smoother than the base degree 8 polynomial and still captures the major trend in the data.
Comparing the coefficients of the non-regularized and regularized models shows that ridge regression favors placing model weights on the lower degree polynomial terms:
# HIDDEN
base = coef_table(clfs[2]).rename(columns={'Coefficient Value': 'Base'})
ridge = coef_table(ridge_clfs[2]).rename(columns={'Coefficient Value': 'Regularized'})
pd.options.display.max_rows = 20
display(base.join(ridge))
pd.options.display.max_rows = 7
Repeating the process for degree 12 polynomial features results in a similar result:
# HIDDEN
fig = plt.figure(figsize=(10, 5))
plt.subplot(121)
plot_data()
plot_curve(curves[3])
plt.title('Base degree 12 polynomial')
plt.ylim(-5e10, 170e10)
plt.subplot(122)
plot_data()
plot_curve(ridge_curves[3])
plt.title('Regularized degree 12 polynomial')
plt.ylim(-5e10, 170e10)
plt.tight_layout()
Increasing the regularization parameter results in progressively simpler models. The plot below demonstrates the effects of increasing the regularization amount from 0.001 to 100.0.
# HIDDEN
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
alpha_clfs = [Pipeline([
('poly', PolynomialFeatures(degree=12, include_bias=False)),
('reg', Ridge(alpha=alpha, normalize=True))]
).fit(X, y) for alpha in alphas]
alpha_curves = [make_curve(clf) for clf in alpha_clfs]
labels = [f'$\\lambda = {alpha}$' for alpha in alphas]
plot_curves(alpha_curves, cols=3, labels=labels)
As we can see, increasing the regularization parameter increases the bias of our model. If our parameter is too large, the model becomes a constant model because any non-zero model weight is heavily penalized.
Using regularization allows us to tune model bias and variance by penalizing large model weights. regularization for least squares linear regression is also known by the more common name ridge regression. Using regularization adds an additional model parameter that we adjust using cross-validation.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/16'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + nrows, col:col + ncols]
if len(df.columns) <= ncols:
interact(peek, row=(0, len(df) - nrows, nrows), col=fixed(0))
else:
interact(peek,
row=(0, len(df) - nrows, nrows),
col=(0, len(df.columns) - ncols))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
# HIDDEN
df = pd.read_csv('water_large.csv')
# HIDDEN
from collections import namedtuple
Curve = namedtuple('Curve', ['xs', 'ys'])
def flatten(seq): return [item for subseq in seq for item in subseq]
def make_curve(clf, x_start=-50, x_end=50):
xs = np.linspace(x_start, x_end, num=100)
ys = clf.predict(xs.reshape(-1, 1))
return Curve(xs, ys)
def plot_data(df=df, ax=plt, **kwargs):
ax.scatter(df.iloc[:, 0], df.iloc[:, 1], s=50, **kwargs)
def plot_curve(curve, ax=plt, **kwargs):
ax.plot(curve.xs, curve.ys, **kwargs)
def plot_curves(curves, cols=2, labels=None):
if labels is None:
labels = [f'Deg {deg} poly' for deg in degrees]
rows = int(np.ceil(len(curves) / cols))
fig, axes = plt.subplots(rows, cols, figsize=(10, 8),
sharex=True, sharey=True)
for ax, curve, label in zip(flatten(axes), curves, labels):
plot_data(ax=ax, label='Training data')
plot_curve(curve, ax=ax, label=label)
ax.set_ylim(-5e10, 170e10)
ax.legend()
# add a big axes, hide frame
fig.add_subplot(111, frameon=False)
# hide tick and tick label of the big axes
plt.tick_params(labelcolor='none', top='off', bottom='off',
left='off', right='off')
plt.grid(False)
plt.title('Polynomial Regression')
plt.xlabel('Water Level Change (m)')
plt.ylabel('Water Flow (Liters)')
plt.tight_layout()
# HIDDEN
def coefs(clf):
reg = clf.named_steps['reg']
return np.append(reg.intercept_, reg.coef_)
def coef_table(clf):
vals = coefs(clf)
return (pd.DataFrame({'Coefficient Value': vals})
.rename_axis('degree'))
# HIDDEN
X = df.iloc[:, [0]].as_matrix()
y = df.iloc[:, 1].as_matrix()
degrees = [1, 2, 8, 12]
clfs = [Pipeline([('poly', PolynomialFeatures(degree=deg, include_bias=False)),
('reg', LinearRegression())])
.fit(X, y)
for deg in degrees]
curves = [make_curve(clf) for clf in clfs]
alphas = [0.1, 1.0, 10.0]
ridge_clfs = [Pipeline([('poly', PolynomialFeatures(degree=deg, include_bias=False)),
('reg', RidgeCV(alphas=alphas, normalize=True))])
.fit(X, y)
for deg in degrees]
ridge_curves = [make_curve(clf) for clf in ridge_clfs]
lasso_clfs = [Pipeline([('poly', PolynomialFeatures(degree=deg, include_bias=False)),
('reg', LassoCV(normalize=True, precompute=True, tol=0.001))])
.fit(X, y)
for deg in degrees]
lasso_curves = [make_curve(clf) for clf in lasso_clfs]
In this section we introduce regularization, another regularization technique that is useful for feature selection.
We start with a brief review of regularization for linear regression. We use the model:
We fit the model by minimizing the mean squared error cost function with an additional regularization term:
$$ \begin{aligned} L(\hat{\theta}, X, y) &= \frac{1}{n} \sum_{i}(yi - f\hat{\theta} (X_i))^2
+ \lambda \sum_{j = 1}^{p} \hat{\theta_j}^2
\end{aligned} $$
In the above definitions, represents the data matrix, represents a row of , represents the observed outcomes, represents the model weights, and represents the regularization parameter.
To add regularization to the model, we modify the cost function above:
$$ \begin{aligned} L(\hat{\theta}, X, y) &= \frac{1}{n} \sum_{i}(yi - f\hat{\theta} (X_i))^2
+ \lambda \sum_{j = 1}^{p} |\hat{\theta_j}|
\end{aligned} $$
Observe that the two cost functions only differ in their regularization term. regularization penalizes the sum of the absolute weight values instead of the sum of squared values.
Using regularization with a linear model and the mean squared error cost function is also known more commonly as lasso regression. (Lasso stands for Least Absolute Shrinkage and Selection Operator.)
To conduct lasso regression, we make use of scikit-learn's convenient LassoCV classifier, a version of the Lasso classifier that performs cross-validation to select the regularization parameter. Below, we display our dataset of water level change and water flow out of a dam.
# HIDDEN
df
Since the procedure is almost identical to using the RidgeCV classifier from the previous section, we omit the code and instead display the base degree 12 polynomial, ridge regression, and lasso regression model predictions below.
# HIDDEN
fig = plt.figure(figsize=(10, 4))
plt.subplot(131)
plot_data()
plot_curve(curves[3])
plt.title('Base')
plt.ylim(-5e10, 170e10)
plt.subplot(132)
plot_data()
plot_curve(ridge_curves[3])
plt.title('Ridge Regression')
plt.ylim(-5e10, 170e10)
plt.subplot(133)
plot_data()
plot_curve(lasso_curves[3])
plt.title('Lasso Regression')
plt.ylim(-5e10, 170e10)
plt.tight_layout()
We can see that both regularized models have less variance than the base degree 12 polynomial. At a glance, it appears that using and regularization produces nearly identical models. Comparing the coefficients of ridge and lasso regression, however, reveals the most significant difference between the two types of regularization: the lasso regression model sets a number of model weights to zero.
# HIDDEN
ridge = coef_table(ridge_clfs[3]).rename(columns={'Coefficient Value': 'Ridge'})
lasso = coef_table(lasso_clfs[3]).rename(columns={'Coefficient Value': 'Lasso'})
pd.options.display.max_rows = 20
pd.set_option('display.float_format', '{:.10f}'.format)
display(ridge.join(lasso))
pd.options.display.max_rows = 7
pd.set_option('display.float_format', '{:.2f}'.format)
If you will forgive the verbose output above, you will notice that ridge regression results in non-zero weights for the all polynomial features. Lasso regression, on the other hand, produces weights of zero for seven features.
In other words, the lasso regression model completely tosses out a majority of the features when making predictions. Nonetheless, the plots above show that the lasso regression model will make nearly identical predictions compared to the ridge regression model.
Lasso regression performs feature selection—it discards a subset of the original features when fitting model parameters. This is particularly useful when working with high-dimensional data with many features. A model that only uses a few features to make a prediction will run much faster than a model that requires many calculations. Since unneeded features tend to increase model variance without decreasing bias, we can sometimes increase the accuracy of other models by using lasso regression to select a subset of features to use.
If our goal is merely to achieve the highest prediction accuracy, we can try both types of regularization and use cross-validation to select between the two types.
Sometimes we prefer one type of regularization over the other because it maps more closely to the domain we are working with. For example, if know that the phenomenon we are trying to model results from many small factors, we might prefer ridge regression because it won't discard these factors. On the other hand, some outcomes result from a few highly influential features. We prefer lasso regression in these situations because it will discard unneeded features.
Using regularization, like regularization, allows us to tune model bias and variance by penalizing large model weights. regularization for least squares linear regression is also known by the more common name lasso regression. Lasso regression may also be used to perform feature selection since it discards insignificant features.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/17'))
Thus far we have studied models for regression, the process of making continuous, numerical estimations based on data. We now turn our attention to classification, the process of making categorical predictions based on data. For example, weather stations are interested in predicting whether tomorrow will be rainy or not using the weather conditions today.
Together, regression and classification compose the primary approaches for supervised learning, the general task of learning a model based on observed input-output pairs.
We may reconstruct classification as a type of regression problem. Instead of creating a model to predict an arbitrary number, we create a model to predict a probability that a data point belongs to a category. This allows us to reuse the machinery of linear regression for a regression on probabilities: logistic regression.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/17'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + nrows, col:col + ncols]
if len(df.columns) <= ncols:
interact(peek, row=(0, len(df) - nrows, nrows), col=fixed(0))
else:
interact(peek,
row=(0, len(df) - nrows, nrows),
col=(0, len(df.columns) - ncols))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
# HIDDEN
def jitter_df(df, x_col, y_col):
x_jittered = df[x_col] + np.random.normal(scale=0, size=len(df))
y_jittered = df[y_col] + np.random.normal(scale=0.05, size=len(df))
return df.assign(**{x_col: x_jittered, y_col: y_jittered})
In basketball, players score by shooting a ball through a hoop. One such player, LeBron James, is widely considered one of the best basketball players ever for his incredible ability to score.

LeBron plays in the National Basketball Association (NBA), the United States's premier basketball league. We've collected a dataset of all of LeBron's attempts in the 2017 NBA Playoff Games using the NBA statistics website (https://stats.nba.com/).
lebron = pd.read_csv('lebron.csv')
lebron
Each row of this dataset contains the following attributes of a shot LeBron attempted:
game_date: The date of the game.minute: The minute that the shot was attempted (each NBA game is 48 minutes long).opponent: The team abbreviation of LeBron's opponent.action_type: The type of action leading up to the shot.shot_type': The type of shot (either a 2 point shot or 3 point shot).shot_distance: LeBron's distance from the basket when the shot was attempted (ft).shot_made: 0 if the shot missed, 1 if the shot went in.We would like to use this dataset to predict whether LeBron will make future shots. This is a classification problem; we predict a category, not a continuous number as we do in regression.
We may reframe this classification problem as a type of regression problem by predicting the probability that a shot will go in. For example, we expect that the probability that LeBron makes a shot is lower when he is farther away from the basket.
We plot the shot attempts below, showing the distance from the basket on the x-axis and whether he made the shot on the y-axis. Jittering the points slightly on the y-axis mitigates overplotting.
# HIDDEN
np.random.seed(42)
sns.lmplot(x='shot_distance', y='shot_made',
data=jitter_df(lebron, 'shot_distance', 'shot_made'),
fit_reg=False,
scatter_kws={'alpha': 0.3})
plt.title('LeBron Shot Make vs. Shot Distance');
We can see that LeBron tends to make most shots when he is within five feet of the basket. A simple least squares linear regression model fit on this data produces the following predictions:
# HIDDEN
np.random.seed(42)
sns.lmplot(x='shot_distance', y='shot_made',
data=jitter_df(lebron, 'shot_distance', 'shot_made'),
ci=None,
scatter_kws={'alpha': 0.4})
plt.title('Simple Linear Regression');
Linear regression predicts a continuous value. To perform classification, however, we need to convert this value into a category: a shot make or miss. We can accomplish this by setting a cutoff, or classification threshold. If the regression predicts a value greater than 0.5, we predict that the shot will make. Otherwise, we predict that the shot will miss.
We draw the cutoff below as a green dashed line. According to this cutoff, our model predicts that LeBron will make a shot if he is within 15 feet of the basket.
# HIDDEN
np.random.seed(42)
sns.lmplot(x='shot_distance', y='shot_made',
data=jitter_df(lebron, 'shot_distance', 'shot_made'),
ci=None,
scatter_kws={'alpha': 0.4})
plt.axhline(y=0.5, linestyle='--', c='g')
plt.title('Cutoff for Classification');
In the steps above, we attempt to perform a regression to predict the probability that a shot will go in. If our regression produces a probability, setting a cutoff of 0.5 means that we predict that a shot will go in when our model thinks the shot going in is more likely than the shot missing. We will revisit the topic of classification thresholds later in the chapter.
Unfortunately, our linear model's predictions cannot be interpreted as probabilities. Valid probabilities must lie between zero and one, but our linear model violates this condition. For example, the probability that LeBron makes a shot when he is 100 feet away from the basket should be close to zero. In this case, however, our model will predict a negative value.
If we alter our regression model so that its predictions may be interpreted as probabilities, we will have no qualms about using its predictions for classification. We accomplish this with a new prediction function and a new loss function. The resulting model is called a logistic model.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/17'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + nrows, col:col + ncols]
if len(df.columns) <= ncols:
interact(peek, row=(0, len(df) - nrows, nrows), col=fixed(0))
else:
interact(peek,
row=(0, len(df) - nrows, nrows),
col=(0, len(df.columns) - ncols))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
# HIDDEN
def jitter_df(df, x_col, y_col):
x_jittered = df[x_col] + np.random.normal(scale=0, size=len(df))
y_jittered = df[y_col] + np.random.normal(scale=0.05, size=len(df))
return df.assign(**{x_col: x_jittered, y_col: y_jittered})
# HIDDEN
lebron = pd.read_csv('lebron.csv')
In this section, we introduce the logistic model, a regression model that we use to predict probabilities.
Recall that fitting a model requires three components: a model that makes predictions, a loss function, and an optimization method. For the by-now familiar least squares linear regression, we select the model:
And the loss function:
We use gradient descent as our optimization method. In the definitions above, represents the data matrix ( is the number of data points and is the number of attributes), represents a row of , and is the vector of observed outcomes. The vector contains the optimal model weights whereas contains intermediate weight values generated during optimization.
Observe that the model can output any real number since it produces a linear combination of the values in , which itself can contain any value from .
We can easily visualize this when is a scalar. If , our model becomes . Its predictions can take on any value from negative infinity to positive infinity:
# HIDDEN
xs = np.linspace(-100, 100, 100)
ys = 0.5 * xs
plt.plot(xs, ys)
plt.xlabel('$x$')
plt.ylabel(r'$f_\hat{\theta}(x)$')
plt.title(r'Model Predictions for $ \hat{\theta} = 0.5 $');
For classification tasks, we want to constrain so that its output can be interpreted as a probability. This means that it may only output values in the range . In addition, we would like large values of to correspond to high probabilities and small values to low probabilities.
To accomplish this, we introduce the logistic function, often called the sigmoid function:
For ease of reading, we often replace with and write:
We plot the sigmoid function for values of below.
# HIDDEN
from scipy.special import expit
xs = np.linspace(-10, 10, 100)
ys = expit(xs)
plt.plot(xs, ys)
plt.title(r'Sigmoid Function')
plt.xlabel('$ t $')
plt.ylabel(r'$ \sigma(t) $');
Observe that the sigmoid function takes in any real number and outputs only numbers between 0 and 1. The function is monotonically increasing on its input ; large values of correspond to values closer to 1, as desired. This is not a coincidence—the sigmoid function may be derived from a log ratio of probabilities, although we omit the derivation for brevity.
We may now take our linear model and use it as the input to the sigmoid function to create the logistic model:
In other words, we take the output of linear regression—any number in — and use the sigmoid function to restrict the model's final output to be a valid probability between zero and one.
To develop some intuition for how the logistic model behaves, we restrict to be a scalar and plot the logistic model's output for several values of .
# HIDDEN
def flatten(li): return [item for sub in li for item in sub]
thetas = [-2, -1, -0.5, 2, 1, 0.5]
xs = np.linspace(-10, 10, 100)
fig, axes = plt.subplots(2, 3, sharex=True, sharey=True, figsize=(10, 6))
for ax, theta in zip(flatten(axes), thetas):
ys = expit(theta * xs)
ax.plot(xs, ys)
ax.set_title(r'$ \hat{\theta} = $' + str(theta))
# add a big axes, hide frame
fig.add_subplot(111, frameon=False)
# hide tick and tick label of the big axes
plt.tick_params(labelcolor='none', top='off', bottom='off',
left='off', right='off')
plt.grid(False)
plt.xlabel('$x$')
plt.ylabel(r'$ f_\hat{\theta}(x) $')
plt.tight_layout()
We see that changing the magnitude of changes the sharpness of the curve; the further away from , the sharper the curve. Flipping the sign of while keeping magnitude constant is equivalent to reflecting the curve over the y-axis.
We introduce the logistic model, a new prediction function that outputs probabilities. To construct the model, we use the output of linear regression as the input to the nonlinear logistic function.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/17'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + nrows, col:col + ncols]
if len(df.columns) <= ncols:
interact(peek, row=(0, len(df) - nrows, nrows), col=fixed(0))
else:
interact(peek,
row=(0, len(df) - nrows, nrows),
col=(0, len(df.columns) - ncols))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
# HIDDEN
lebron = pd.read_csv('lebron.csv')
We have defined a regression model for probabilities, the logistic model:
Like the model for linear regression, this model has parameters , a vector that contains one parameter for each feature of . We now address the problem of defining a loss function for this model that allows us to fit the model's parameters to data.
Intuitively, we want the model's predictions to match the data as closely as possible. Below we recreate a plot of LeBron's shot attempts in the 2017 NBA Playoffs using the distance of each shot from the basket. The points are jittered on the y-axis to mitigate overplotting.
# HIDDEN
np.random.seed(42)
sns.lmplot(x='shot_distance', y='shot_made',
data=lebron,
fit_reg=False, ci=False,
y_jitter=0.1,
scatter_kws={'alpha': 0.3})
plt.title('LeBron Shot Attempts')
plt.xlabel('Distance from Basket (ft)')
plt.ylabel('Shot Made');
Noticing the large cluster of made shots close to the basket and the smaller cluster of missed shots further from the basket, we expect that a logistic model fitted on this data might look like:
# HIDDEN
from scipy.special import expit
np.random.seed(42)
sns.lmplot(x='shot_distance', y='shot_made',
data=lebron,
fit_reg=False, ci=False,
y_jitter=0.1,
scatter_kws={'alpha': 0.3})
xs = np.linspace(-2, 32, 100)
ys = expit(-0.15 * (xs - 15))
plt.plot(xs, ys, c='r', label='Logistic model')
plt.title('Possible logistic model fit')
plt.xlabel('Distance from Basket (ft)')
plt.ylabel('Shot Made');
Although we can use the mean squared error loss function as we have for linear regression, it is non-convex for a logistic model and thus difficult to optimize.
Instead of the mean squared error, we use the cross-entropy loss. Let represent the input data matrix, the vector of observed data values, and the logistic model. contains the current parameter values. Using this notation, the average cross-entropy loss is defined as:
You may observe that as usual we take the mean loss over each point in our dataset. The inner expression in the above summation represents the loss at one data point :
Recall that each is either 0 or 1 in our dataset. If , the first term in the loss is zero. If , the second term in the loss is zero. Thus, for each point in our dataset, only one term of the cross-entropy loss contributes to the overall loss.
Suppose and our predicted probability —our model is completely correct. The loss for this point will be:
As expected, the loss for a correct prediction is . You may verify that the further the predicted probability is from the true value, the greater the loss.
Minimizing the overall cross-entropy loss requires the model to make the most accurate predictions it can. Conveniently, this loss function is convex, making gradient descent a useful choice for optimization.
In order to run gradient descent on a model's cross-entropy loss we must calculate the gradient of the loss function. First, we compute the derivative of the sigmoid function since we'll use it in our gradient calculation.
The derivative of the sigmoid function can be conveniently expressed in terms of the sigmoid function itself.
As a shorthand, we define . We will soon need the gradient of with respect to the vector so we will derive it now using a straightforward application of the chain rule.
Now, we derive the gradient of the cross-entropy loss with respect to the model parameters . In the derivation below, we let .
$$ \begin{aligned} L(\boldsymbol{\theta}, \textbf{X}, \textbf{y}) &= \frac{1}{n} \sum_i \left(- yi \ln (f\boldsymbol{\theta}(\textbf{X}_i)) - (1 - yi) \ln (1 - f\boldsymbol{\theta}(\textbf{X}_i) \right) \ &= \frac{1}{n} \sum_i \left(- y_i \ln \sigma_i - (1 - y_i) \ln (1 - \sigmai) \right) \ \nabla{\boldsymbol{\theta}} L(\boldsymbol{\theta}, \textbf{X}, \textbf{y}) &= \frac{1}{n} \sum_i \left(
- \frac{y_i}{\sigma_i} \nabla_{\boldsymbol{\theta}} \sigma_i
+ \frac{1 - y_i}{1 - \sigma_i} \nabla_{\boldsymbol{\theta}} \sigma_i \right) \\
&= - \frac{1}{n} \sum_i \left( \frac{y_i}{\sigma_i} - \frac{1 - y_i}{1 - \sigmai} \right) \nabla{\boldsymbol{\theta}} \sigma_i \ &= - \frac{1}{n} \sum_i \left( \frac{y_i}{\sigma_i} - \frac{1 - y_i}{1 - \sigma_i} \right) \sigma_i (1 - \sigma_i) \textbf{X}_i \ &= - \frac{1}{n} \sum_i \left( y_i - \sigma_i \right) \textbf{X}_i \ \end{aligned} $$
The surprisingly simple gradient expression allows us to fit a logistic model to the cross-entropy loss using gradient descent:
Section 17.6 delves into deriving update formulas for batch, stochastic, and mini-batch gradient descent.
Since the cross-entropy loss function is convex, we minimize it using gradient descent to fit logistic models to data. We now have the necessary components of logistic regression: the model, loss function, and minimization procedure. In Section 17.5, we take a closer look at why we use average cross-entropy loss for logistic regression.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/17'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + nrows, col:col + ncols]
if len(df.columns) <= ncols:
interact(peek, row=(0, len(df) - nrows, nrows), col=fixed(0))
else:
interact(peek,
row=(0, len(df) - nrows, nrows),
col=(0, len(df.columns) - ncols))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
# HIDDEN
from scipy.optimize import minimize as sci_min
def minimize(cost_fn, grad_cost_fn, X, y, progress=True):
'''
Uses scipy.minimize to minimize cost_fn using a form of gradient descent.
'''
theta = np.zeros(X.shape[1])
iters = 0
def objective(theta):
return cost_fn(theta, X, y)
def gradient(theta):
return grad_cost_fn(theta, X, y)
def print_theta(theta):
nonlocal iters
if progress and iters % progress == 0:
print(f'theta: {theta} | cost: {cost_fn(theta, X, y):.2f}')
iters += 1
print_theta(theta)
return sci_min(
objective, theta, method='BFGS', jac=gradient, callback=print_theta,
tol=1e-7
).x
We have developed all the components of logistic regression. First, the logistic model used to predict probabilities:
Then, the cross-entropy loss function:
Finally, the gradient of the cross-entropy loss for gradient descent:
In the expressions above, we let represent the input data matrix, a row of , the vector of observed data values, and the logistic model with optimal parameters . As a shorthand, we define .
Let us now return to the problem we faced at the start of this chapter: predicting which shots LeBron James will make. We start by loading the dataset of shots taken by LeBron in the 2017 NBA Playoffs.
lebron = pd.read_csv('lebron.csv')
lebron
We've included a widget below to allow you to pan through the entire DataFrame.
df_interact(lebron)
We start by using only the shot distance to predict whether or not the shot is made. scikit-learn conveniently provides a logistic regression classifier as the sklearn.linear_model.LogisticRegression class. To use the class, we first create our data matrix X and vector of observed outcomes y.
X = lebron[['shot_distance']].as_matrix()
y = lebron['shot_made'].as_matrix()
print('X:')
print(X)
print()
print('y:')
print(y)
As is customary, we split our data into a training set and a test set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=40, random_state=42
)
print(f'Training set size: {len(y_train)}')
print(f'Test set size: {len(y_test)}')
scikit-learn makes it simple to initialize the classifier and fit it on X_train and y_train:
from sklearn.linear_model import LogisticRegression
simple_clf = LogisticRegression()
simple_clf.fit(X_train, y_train)
To visualize the classifier's performance, we plot the original points and the classifier's predicted probabilities.
# HIDDEN
np.random.seed(42)
sns.lmplot(x='shot_distance', y='shot_made',
data=lebron,
fit_reg=False, ci=False,
y_jitter=0.1,
scatter_kws={'alpha': 0.3})
xs = np.linspace(-2, 32, 100)
ys = simple_clf.predict_proba(xs.reshape(-1, 1))[:, 1]
plt.plot(xs, ys)
plt.title('LeBron Training Data and Predictions')
plt.xlabel('Distance from Basket (ft)')
plt.ylabel('Shot Made');
One method to evaluate the effectiveness of our classifier is to check its prediction accuracy: what proportion of points does it predict correctly?
simple_clf.score(X_test, y_test)
Our classifier achieves a rather low accuracy of 0.60 on the test set. If our classifier simply guessed each point at random, we would expect an accuracy of 0.50. In fact, if our classifier simply predicted that every shot LeBron takes will go in, we would also get an accuracy of 0.60:
# Calculates the accuracy if we always predict 1
np.count_nonzero(y_test == 1) / len(y_test)
For this classifier, we only used one out of several possible features. As in multivariable linear regression, we will likely achieve a more accurate classifier by incorporating more features.
Incorporating more numerical features in our classifier is as simple as extracting additional columns from the lebron DataFrame into the X matrix. Incorporating categorical features, on the other hand, requires us to apply a one-hot encoding. In the code below, we augment our classifier with the minute, opponent, action_type, and shot_type features, using the DictVectorizer class from scikit-learn to apply a one-hot encoding to the categorical variables.
from sklearn.feature_extraction import DictVectorizer
columns = ['shot_distance', 'minute', 'action_type', 'shot_type', 'opponent']
rows = lebron[columns].to_dict(orient='row')
onehot = DictVectorizer(sparse=False).fit(rows)
X = onehot.transform(rows)
y = lebron['shot_made'].as_matrix()
X.shape
We will again split the data into a training set and test set:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=40, random_state=42
)
print(f'Training set size: {len(y_train)}')
print(f'Test set size: {len(y_test)}')
Finally, we fit our model once more and check its accuracy:
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(f'Test set accuracy: {clf.score(X_test, y_test)}')
This classifier is around 12% more accurate than the classifier that only took the shot distance into account. In Section 17.7, we explore additional metrics used to evaluate classifier performance.
We have developed the mathematical and computational machinery needed to use logistic regression for classification. Logistic regression is widely used for its simplicity and effectiveness in prediction.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/17'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
In this section, we introduce KL divergence and demonstrate how minimizing average KL divergence in binary classification is equivalent to minimizing average cross-entropy loss.
Since logistic regression outputs probabilities, a logistic model produces a certain type of probability distribution. Specifically, based on optimal parameters , it estimates the probability that the label is for an example input .
For example, suppose that is a scalar recording the forecasted chance of rain for the day and means that Mr. Doe takes his umbrella with him to work. A logistic model with scalar parameter predicts the probability that Mr. Doe takes his umbrella given a forecasted chance of rain: .
Collecting data on Mr. Doe's umbrella usage provides us with a method of constructing an empirical probability distribution . For example, if there were five days where the chance of rain and Mr. Doe only took his umbrella to work once, . We can compute a similar probability distribution for each value of that appears in our data. Naturally, after fitting a logistic model we would like the distribution predicted by the model to be as close as possible to the empirical distribution from the dataset. That is, for all values of that appear in our data, we want:
One commonly used metric to determine the "closeness" of two probability distributions is the Kullback–Leibler divergence, or KL divergence, which has its roots in information theory.
KL divergence quantifies the difference between the probability distribution computed by our logistic model with parameters and the actual distribution based on the dataset. Intuitively, it calculates how imprecisely the logistic model estimates the distribution of labels in data.
The KL divergence for binary classification between two distributions and for a single data point is given by:
KL divergence is not symmetric, i.e., the divergence of from is not the same as the divergence of from :
Since our goal is to use to approximate , we are concerned with .
The best values, which we denote as , minimize the average KL divergence of the entire dataset of points:
In the above equation, the data point is represented as (, ) where is the row of the data matrix and is the observed outcome.
KL divergence does not penalize mismatch for rare events with respect to . If the model predicts a high probability for an event that is actually rare, then both and are low so the divergence is also low. However, if the model predicts a low probability for an event that is actually common, then the divergence is high. We can deduce that a logistic model that accurately predicts common events has a lower divergence from than does a model that accurately predicts rare events but varies widely on common events.
The structure of the above average KL divergence equation contains some surface similarities with cross-entropy loss. We will now show with some algebraic manipulation that minimizing average KL divergence is in fact equivalent to minimizing average cross-entropy loss.
Using properties of logarithms, we can rewrite the weighted log ratio:
Note that since the first term doesn't depend on , it doesn't affect and can be removed from the equation. The resulting expression is the cross-entropy loss of the model :
Since the label is a known value, the probability that , , is equal to and is equal to . The model's probability distribution is given by the output of the sigmoid function discussed in the previous two sections. After making these substitutions, we arrive at the average cross-entropy loss equation:
The cross-entropy loss also has fundamental underpinnings in statistics. Since the logistic regression model predicts probabilities, given a particular logistic model we can ask, "What is the probability that this model produced the set of observed outcomes ?" We might naturally adjust the parameters of our model until the probability of drawing our dataset from the model is as high as possible. Although we will not prove it in this section, this procedure is equivalent to minimizing the cross-entropy loss—this is the maximum likelihood statistical justification for the cross-entropy loss.
Average KL divergence can be interpreted as the average log difference between the two distributions and weighted by . Minimizing average KL divergence also minimizes average cross-entropy loss. We can reduce the divergence of logistic regression models by selecting parameters that accurately classify commonly occurring data.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/17'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
Previously, we covered batch gradient descent, an algorithm that iteratively updates to find the loss-minimizing parameters . We also discussed stochastic gradient descent and mini-batch gradient descent, methods that take advantage of statistical theory and parallelized hardware to decrease the time spent training the gradient descent algorithm. In this section, we will apply these concepts to logistic regression and walk through examples using scikit-learn functions.
The general update formula for batch gradient descent is given by:
In logistic regression, we use the cross entropy loss as our loss function:
The gradient of the cross entropy loss is . Plugging this into the update formula allows us to find the gradient descent algorithm specific to logistic regression. Letting :
Stochastic gradient descent approximates the gradient of the loss function across all observations using the gradient of the loss of a single data point.The general update formula is below, where is the loss function for a single data point:
Returning back to our example in logistic regression, we approximate the gradient of the cross entropy loss across all data points using the gradient of the cross entropy loss of one data point. This is shown below, with .
When we plug this approximation into the general formula for stochastic gradient descent, we find the stochastic gradient descent update formula for logistic regression.
Similarly, we can approximate the gradient of the cross entropy loss for all observations using a random sample of data points, known as a mini-batch.
We substitute this approximation for the gradient of the cross entropy loss, yielding a mini-batch gradient descent update formula specific to logistic regression:
Scikit-learn's SGDClassifier class provides an implementation for stochastic gradient descent, which we can use by specifying loss=log. Since scikit-learn does not have a model that implements batch gradient descent, we will compare SGDClassifier's performance against LogisticRegression on the emails dataset. We omit feature extraction for brevity:
# HIDDEN
emails = pd.read_csv('emails_sgd.csv').sample(frac=0.5)
X, y = emails['email'], emails['spam']
X_tr = CountVectorizer().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_tr, y, random_state=42)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)
log_reg = LogisticRegression(tol=0.0001, random_state=42)
stochastic_gd = SGDClassifier(tol=0.0001, loss='log', random_state=42)
%%time
log_reg.fit(X_train, y_train)
log_reg_pred = log_reg.predict(X_test)
print('Logistic Regression')
print(' Accuracy: ', accuracy_score(y_test, log_reg_pred))
print(' Precision: ', precision_score(y_test, log_reg_pred))
print(' Recall: ', recall_score(y_test, log_reg_pred))
print()
%%time
stochastic_gd.fit(X_train, y_train)
stochastic_gd_pred = stochastic_gd.predict(X_test)
print('Stochastic GD')
print(' Accuracy: ', accuracy_score(y_test, stochastic_gd_pred))
print(' Precision: ', precision_score(y_test, stochastic_gd_pred))
print(' Recall: ', recall_score(y_test, stochastic_gd_pred))
print()
The results above indicate that SGDClassifier is able to find a solution in significantly less time than LogisticRegression. Although the evaluation metrics are slightly worse on the SGDClassifier, we can improve the SGDClassifier's performance by tuning hyperparameters. Furthermore, this discrepancy is a tradeoff that data scientists often encounter in the real world. Depending on the situation, data scientists might place greater value on the lower runtime or on the higher metrics.
Stochastic gradient descent is a method that data scientists use to cut down on computational cost and runtime. We can see the value of stochastic gradient descent in logistic regression, since we would only have to calculate the gradient of the cross entropy loss for one observation at each iteration instead of for every observation in batch gradient descent. From the example using scikit-learn's SGDClassifier, we observe that stochastic gradient descent may achieve slightly worse evaluation metrics, but drastically improves runtime. On larger datasets or for more complex models, the difference in runtime might be much larger and thus more valuable.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/17'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
emails=pd.read_csv('selected_emails.csv', index_col=0)
# HIDDEN
def words_in_texts(words, texts):
'''
Args:
words (list-like): words to find
texts (Series): strings to search in
Returns:
NumPy array of 0s and 1s with shape (n, p) where n is the
number of texts and p is the number of words.
'''
indicator_array = np.array([texts.str.contains(word) * 1 for word in words]).T
# YOUR CODE HERE
return indicator_array
Although we used the classification accuracy to evaluate our logistic model in previous sections, using the accuracy alone has some serious flaws that we explore in this section. To address these issues, we introduce a more useful metric to evaluate classifier performance: the Area Under Curve (AUC) metric.
Suppose we have a dataset of 1000 emails that are labeled as spam or ham (not spam) and our goal is to build a classifier that distinguishes future spam emails from ham emails. The data is contained in the emails DataFrame displayed below:
emails
Each row contains the body of an email in the body column and a spam indicator in the spam column, which is 0 if the email is ham or 1 if it is spam.
Let's compare the performance of three different classifiers:
ham_only: labels every email as ham.spam_only: labels every email as spam.words_list_model: predicts 'ham' or 'spam' based on the presence of certain words in the body of an email.Suppose we have a list of words words_list that we believe are common in spam emails: "please", "click", "money", "business", and "remove". We construct words_list_model using the following procedure: transform each email into a feature vector by setting the vector's th entry to 1 if the th word in words_list is contained in the email body and 0 if it isn't. For example, using our five chosen words and the email body "please remove by tomorrow", the feature vector would be . This procedure generates the 1000 X 5 feature matrix .
The following code block displays the accuracies of the classifiers. Model creation and training are omitted for brevity.
# HIDDEN
words_list = ['please', 'click', 'money', 'business', 'remove']
X = pd.DataFrame(words_in_texts(words_list, emails['body'].str.lower())).as_matrix()
y = emails['spam'].as_matrix()
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=41, test_size=0.2
)
#Fit the model
words_list_model = LogisticRegression(fit_intercept=True)
words_list_model.fit(X_train, y_train)
y_prediction_words_list = words_list_model.predict(X_test)
y_prediction_ham_only = np.zeros(len(y_test))
y_prediction_spam_only = np.ones(len(y_test))
from sklearn.metrics import accuracy_score
# Our selected words
words_list = ['please', 'click', 'money', 'business']
print(f'ham_only test set accuracy: {np.round(accuracy_score(y_prediction_ham_only, y_test), 3)}')
print(f'spam_only test set accuracy: {np.round(accuracy_score(y_prediction_spam_only, y_test), 3)}')
print(f'words_list_model test set accuracy: {np.round(accuracy_score(y_prediction_words_list, y_test), 3)}')
Using words_list_model classifies 96% of the test set emails correctly. Although this accuracy appears high, ham_only achieves the same accuracy by simply labeling everything as ham. This is cause for concern because the data suggests we can do just as well without a spam filter at all.
As the accuracies above show, model accuracy alone can be a misleading indicator of model performance. We can understand the model's predictions in greater depth using a confusion matrix. A confusion matrix for a binary classifier is a two-by-two heatmap that contains the model predictions on one axis and the actual labels on the other.
Each entry in a confusion matrix represents a possible outcome of the classifier. If a spam email is input to the classifier, there are two possible outcomes:
Similarly, if a ham email is input to the classifier, there are two possible outcomes.
The costs of false positives and false negatives depend on the situation. For email classification, false positives result in important emails being filtered out, so they are worse than false negatives, in which a spam email winds up in the inbox. In medical settings, however, false negatives in a diagnostic test can be much more consequential than false positives.
We will use scikit-learn's confusion matrix function to construct confusion matrices for the three models on the training data set. The ham_only confusion matrix is shown below:
# HIDDEN
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
import itertools
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
# print("Normalized confusion matrix")
# else:
# print('Confusion matrix, without normalization')
# print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.grid(False)
ham_only_y_pred = np.zeros(len(y_train))
spam_only_y_pred = np.ones(len(y_train))
words_list_model_y_pred = words_list_model.predict(X_train)
from sklearn.metrics import confusion_matrix
class_names = ['Spam', 'Ham']
ham_only_cnf_matrix = confusion_matrix(y_train, ham_only_y_pred, labels=[1, 0])
plot_confusion_matrix(ham_only_cnf_matrix, classes=class_names,
title='ham_only Confusion Matrix')
Summing the quantities in a row indicates how many emails in the training dataset belong to the corresponding class:
Summing the quantities in a column indicates how many emails the classifier predicted in the corresponding class:
ham_only predicted there are 0 spam emails in the training dataset.ham_only predicted there are 800 ham emails in the training dataset.We can see that ham_only had a high accuracy of because there are 758 ham emails in the training dataset out of 800 total emails.
spam_only_cnf_matrix = confusion_matrix(y_train, spam_only_y_pred, labels=[1, 0])
plot_confusion_matrix(spam_only_cnf_matrix, classes=class_names,
title='spam_only Confusion Matrix')
At the other extreme, spam_only predicts the training dataset has no ham emails, which the confusion matrix indicates is far from the truth with 758 false positives.
Our main interest is the confusion matrix for words_list_model:
words_list_model_cnf_matrix = confusion_matrix(y_train, words_list_model_y_pred, labels=[1, 0])
plot_confusion_matrix(words_list_model_cnf_matrix, classes=class_names,
title='words_list_model Confusion Matrix')
The row totals match those of the ham_only and spam_only confusion matrices as expected since the true labels in the training dataset are unaltered for all models.
Of the 42 spam emails, words_list_model correctly classifies 18 of them, which is a poor performance. Its high accuracy is buoyed by the large number of true negatives, but this is insufficient because it does not serve its purpose of reliably filtering out spam emails.
This emails dataset is an example of a class-imbalanced dataset, in which a vast majority of labels belong to one class over the other. In this case, most of our emails are ham. Another common example of class imbalance is disease detection when the frequency of the disease in a population is low. A medical test that always concludes a patient doesn't have the disease will have a high accuracy because most patients truly won't have the disease, but its inability to identify individuals with the disease renders it useless.
We now turn to sensitivity and specificity, two metrics that are better suited for evaluating class-imbalanced datasets.
Sensitivity (also called true positive rate) measures the proportion of data belonging to the positive class that the classifier correctly labels.
From our discussion of confusion matrices, you should recognize the expression as the sum of the entries in the first row, which is equal to the actual number of data points belonging to the positive class in the dataset. Using confusion matrices allows us to easily compare the sensitivities of our models:
ham_only: spam_only: words_list_model: Since ham_only has no true positives, it has the worst possible sensitivity value of 0. On the other hand, spam_only has an abysmally low accuracy but it has the best possible sensitivity value of 1 because it labels all spam emails correctly. The low sensitivity of words_list_model indicates that it frequently fails to mark spam emails as such; nevertheless, it significantly outperforms ham_only.
Specificity (also called true negative rate) measures the proportion of data belonging to the negative class that the classifier correctly labels.
The expression is equal to the actual number of data points belonging to the negative class in the dataset. Again the confusion matrices help to compare the specificities of our models:
ham_only: spam_only: words_list_model: As with sensitivity, the worst and best specificities are 0 and 1 respectively. Notice that ham_only has the best specificity and worst sensitivity, while spam_only has the worst specificity and best sensitivity. Since these models only predict one label, they will misclassify all instances of the other label, which is reflected in the extreme sensitivity and specificity values. The disparity is much smaller for words_list_model.
Although sensitivity and specificity seem to describe different characteristics of a classifier, we draw an important connection between these two metrics using the classification threshold.
The classification threshold is a value that determines what class a data point is assigned to; points that fall on opposite sides of the threshold are labeled with different classes. Recall that logistic regression outputs a probability that the data point belongs to the positive class. If this probability is greater than the threshold, the data point is labeled with the positive class, and if it is below the threshold, the data point is labeled with the negative class. For our case, let be the logistic model and the threshold. If , is labeled spam; if , is labeled ham. Scikit-learn breaks ties by defaulting to the negative class, so if , is labeled ham.
We can assess our model's performance with classification threshold by creating a confusion matrix. The words_list_model confusion matrix displayed earlier in the section uses scikit learn's default threshold .
Raising the threshold to , meaning we label an email as spam if the probability is greater than .70, results in the following confusion matrix:
# HIDDEN
words_list_prediction_probabilities = words_list_model.predict_proba(X_train)[:, 1]
words_list_predictions = [1 if pred >= .70 else 0 for pred in words_list_prediction_probabilities]
high_classification_threshold = confusion_matrix(y_train, words_list_predictions, labels=[1, 0])
plot_confusion_matrix(high_classification_threshold, classes=class_names,
title='words_list_model Confusion Matrix $C = .70$')
By raising the bar for classifying an email as spam, 13 spam emails that were correctly classified with are now mislabeled.
Compared to the default, a higher threshold of increases specificity but decreases sensitivity.
Lowering the threshold to , meaning we label an email as spam if the probability is greater than .30, results in the following confusion matrix:
# HIDDEN
words_list_predictions = [1 if pred >= .30 else 0 for pred in words_list_prediction_probabilities]
low_classification_threshold = confusion_matrix(y_train, words_list_predictions, labels=[1, 0])
plot_confusion_matrix(low_classification_threshold, classes=class_names,
title='words_list_model Confusion Matrix $C = .30$')
By lowering the bar for classifying an email as spam, 6 spam emails that were mislabeled with are now correct. However, there are more false positives.
Compared to the default, a lower threshold of increases sensitivity but decreases specificity.
We adjust a model's sensitivity and specificity by changing the classification threshold. Although we strive to maximize both sensitivity and specificity, we can see from the confusion matrices created with varying classification thresholds that there is a tradeoff. Increasing sensitivity leads to a decrease in specificity and vice versa.
We can calculate sensitivity and specificity values for all classification thresholds between 0 and 1 and plot them. Each threshold value is associated with a (sensitivity, specificity) pair. A ROC (Receiver Operating Characteristic) curve is a slight modification of this idea; instead of plotting (sensitivity, specificity) it plots (sensitivity, 1 - specificity) pairs, where 1 - specificity is defined as the false positive rate.
A point on the ROC curve represents the sensitivity and false positive rate associated with a specific threshold value.
The ROC curve for words_list_model is calculated below using scikit-learn's ROC curve function:
from sklearn.metrics import roc_curve
words_list_model_probabilities = words_list_model.predict_proba(X_train)[:, 1]
false_positive_rate_values, sensitivity_values, thresholds = roc_curve(y_train, words_list_model_probabilities, pos_label=1)
# HIDDEN
plt.step(false_positive_rate_values, sensitivity_values, color='b', alpha=0.2,
where='post')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('Sensitivity')
plt.title('words_list_model ROC Curve')
Notice that as we move from left to right across the curve, sensitivity increases and the specificity decreases. Generally, the best classification threshold corresponds to high sensitivity and specificity (low false positive rate), so points in or around the northwest corner of the plot are preferable.
Let's examine the four corners of the plot:
ham_only since no email can have a probability greater than .spam_only since no email can have a probability lower than .A classifier that randomly predicts classes has a diagonal ROC curve containing all points where sensitivity and the false positive rate are equal:
# HIDDEN
plt.step(np.arange(0, 1, 0.001), np.arange(0, 1, 0.001), color='b', alpha=0.2,
where='post')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('Sensitivity')
plt.title('Random Classifier ROC Curve')
Intuitively, a random classifier that predicts probability for input will result in either a true positive or false positive with chance , so sensitivity and the false positive rate are equal.
We want our classifier's ROC curve to be high above the random model diagnoal line, which brings us to the AUC metric.
The Area Under Curve (AUC) is the area under the ROC curve and serves as a single number performance summary of the classifier. The AUC for words_list_model is shaded below and calculating using scikit-learn's AUC function:
# HIDDEN
plt.fill_between(false_positive_rate_values, sensitivity_values, step='post', alpha=0.2,
color='b')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('Sensitivity')
plt.title('words_list_model ROC Curve')
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train, words_list_model_probabilities)
AUC is interpreted as the probability that the classifier will assign a higher probability to a randomly chosen data point truly belonging to the positive class than a randomly chosen data point truly belonging to the negative class. A perfect AUC value of 1 corresponds to a perfect classifier (the ROC curve would contain (0, 1). The fact that words_list_model has an AUC of .906 means that roughly 90.6% of the time it is more likely to classify a spam email as spam than a ham email as spam.
By inspection, the AUC of the random classifier is 0.5, though this can vary slightly due to the randomness. An effective model will have an AUC much higher than 0.5, which words_list_model achieves. If a model's AUC is less than 0.5, it performs worse than random predictions.
AUC is an essential metric for evaluating models on class-imbalanced datasets. After training a model, it is best practice to generate an ROC curve and calculate AUC to determine the next step. If the AUC is sufficiently high, use the ROC curve to identify the best classification threshold. However, if the AUC is not satisfactory, consider doing further EDA and feature selection to improve the model.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/17'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
markers = {'triangle':['^', sns.color_palette()[0]],
'square':['s', sns.color_palette()[1]],
'circle':['o', sns.color_palette()[2]]}
def plot_binary(data, label):
data_copy = data.copy()
data_copy['$y$ == ' + label] = (data_copy['$y$'] == label).astype('category')
sns.lmplot('$x_1$', '$x_2$', data=data_copy, hue='$y$ == ' + label, hue_order=[True, False],
markers=[markers[label][0], 'x'], palette=[markers[label][1], 'gray'],
fit_reg=False)
plt.xlim(1.0, 4.0)
plt.ylim(1.0, 4.0);
# HIDDEN
def plot_confusion_matrix(y_test, y_pred):
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cbar=False, cmap=matplotlib.cm.get_cmap('gist_yarg'))
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.xticks([0.5, 1.5, 2.5], ['iris-setosa', 'iris-versicolor', 'iris-virginica'])
plt.yticks([0.5, 1.5, 2.5], ['iris-setosa', 'iris-versicolor', 'iris-virginica'], rotation='horizontal')
ax = plt.gca()
ax.xaxis.set_ticks_position('top')
ax.xaxis.set_label_position('top')
Our classifiers thus far perform binary classification where each observation belongs to one of two classes; we classified emails as either ham or spam, for example. However, many data science problems involve multiclass classification, in which we would like to classify observations as one of several different classes. For example, we may be interested in classifying emails into folders such as Family, Friends, Work, and Promotions. To solve these types of problems, we use a new method called one-vs-rest (OvR) classification.
In OvR classification (also known as one-vs-all, or OvA), we decompose a multiclass classification problem into several different binary classification problems. For example, we might observe training data as shown below:
# HIDDEN
shapes = pd.DataFrame(
[[1.3, 3.6, 'triangle'], [1.6, 3.2, 'triangle'], [1.8, 3.8, 'triangle'],
[2.0, 1.2, 'square'], [2.2, 1.9, 'square'], [2.6, 1.4, 'square'],
[3.2, 2.9, 'circle'], [3.5, 2.2, 'circle'], [3.9, 2.5, 'circle']],
columns=['$x_1$', '$x_2$', '$y$']
)
# HIDDEN
sns.lmplot('$x_1$', '$x_2$', data=shapes, hue='$y$', markers=['^', 's', 'o'], fit_reg=False)
plt.xlim(1.0, 4.0)
plt.ylim(1.0, 4.0);
Our goal is to build a multiclass classifier that labels observations as triangle, square, or circle given values for and . First, we want to build a binary classifier lr_triangle that predicts observations as triangle or not triangle:
plot_binary(shapes, 'triangle')
Similarly, we build binary classifiers lr_square and lr_circle for the remaining classes:
plot_binary(shapes, 'square')
plot_binary(shapes, 'circle')
We know that the output of the sigmoid function in logistic regression is a probability value from 0 to 1. To solve our multiclass classification task, we find the probability of the positive class in each binary classifier and select the class that outputs the highest positive class probability. For example, if we have a new observation with the following values:
| 3.2 | 2.5 |
Then our multiclass classifier would input these values to each of lr_triangle, lr_square, and lr_circle. We extract the positive class probability of each of the three classifiers:
# HIDDEN
lr_triangle = LogisticRegression(random_state=42)
lr_triangle.fit(shapes[['$x_1$', '$x_2$']], shapes['$y$'] == 'triangle')
proba_triangle = lr_triangle.predict_proba([[3.2, 2.5]])[0][1]
lr_square = LogisticRegression(random_state=42)
lr_square.fit(shapes[['$x_1$', '$x_2$']], shapes['$y$'] == 'square')
proba_square = lr_square.predict_proba([[3.2, 2.5]])[0][1]
lr_circle = LogisticRegression(random_state=42)
lr_circle.fit(shapes[['$x_1$', '$x_2$']], shapes['$y$'] == 'circle')
proba_circle = lr_circle.predict_proba([[3.2, 2.5]])[0][1]
lr_triangle |
lr_square |
lr_circle |
|---|---|---|
| 0.145748 | 0.285079 | 0.497612 |
Since the positive class probability of lr_circle is the greatest of the three, our multiclass classifier predicts that the observation is a circle.
The Iris dataset is a famous dataset that is often used in data science to explore machine learning concepts. There are three classes, each representing a type of Iris plant:
There are four features available in the dataset:

We will create a multiclass classifier that predicts the type of Iris plant based on the four features above. First, we read in the data:
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
header=None, names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])
iris
X, y = iris.drop('species', axis=1), iris['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42)
After dividing the dataset into train and test splits, we fit a multiclass classifier to our training data. By default, scikit-learn's LogisticRegression sets multi_class='ovr', which creates binary classifiers for each unique class:
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
We predict on the test data, and use a confusion matrix to evaluate the results.
y_pred = lr.predict(X_test)
plot_confusion_matrix(y_test, y_pred)
The confusion matrix shows that our classifier misclassified two Iris-versicolor observations as Iris-virginica. In observing the sepal_length and sepal_width features, we can hypothesize why this may have occurred:
# HIDDEN
sns.lmplot(x='sepal_length', y='sepal_width', data=iris, hue='species', fit_reg=False);
The Iris-versicolor and Iris-virginica points overlap for these two features. Though the remaining features (petal_width and petal_length) contribute additional information to help distinguish between the two classes, our classifier still misclassified the two observations.
Likewise in the real world, misclassifications can be common if two classes bear similar features. Confusion matrices are valuable because they help us identify the errors that our classifier makes, and thus provides insight on what kind of additional features we may need to extract in order to improve the classifier.
Another type of classification problem is multilabel classification, in which each observation can have multiple labels. An example would be a document classification system: a document can have positive or negative sentiment, religious or nonreligious content, and liberal or conservative leaning. Multilabel problems can also be multiclass; we may want our document classification system to distinguish between a list of genres, or identify the language that the document is written in.
We may perform multilabel classification by simply training a separate classifier on each set of labels. To label a new point, we combine each classifier's predictions.
Classification problems are often complex in nature. Sometimes, the problem requires us to distinguish an observation between multiple classes; in other situations, we may need to assign several labels to each observation. We leverage our knowledge of binary classifiers to create multiclass and multilabel classification systems that can achieve these tasks.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/18'))
Although data scientists often work with individual samples of data, we are almost always interested in making generalizations about the population that the data were collected from. This chapter discusses methods for statistical inference, the process of drawing conclusions about a entire population using a dataset.
Statistical inference primarily leans on two methods: hypothesis tests and confidence intervals. In the recent past these methods relied heavily on normal theory, a branch of statistics that requires substantial assumptions about the population. Today, the rapid rise of powerful computing resources has enabled a new class of methods based on resampling that generalize to many types of populations.
We first review inference using permutation tests and the bootstrap method. We then introduce bootstrap methods for regression inference and skewed distributions.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/18'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
In this section, we provide a brief review of hypothesis testing using the bootstrap and permutation tests. We assume familiarity with this topic since it is covered at length in Computational and Inferential Thinking, the textbook for Data 8. For a more thorough explanation of the concepts explained here, see Chapter 11, Chapter 12, and Chapter 13 of Computational and Inferential Thinking.
When applying data science techniques to different domains, we are often faced with questions about the world. For example, does drinking coffee cause sleep deprivation? Do autonomous vehicles crash more often then non-autonomous vehicles? Does drug X help treat pneumonia? To help answer these questions, we use hypothesis tests to make informed conclusions based on observed evidence/data.
Since data collection is often an imprecise process, we are often unsure whether the patterns in our dataset are due to noise or real phenomena. Hypothesis testing helps us determine whether a pattern could have happened because of random fluctuations in our data collection.
To explore hypothesis testing, we start with an example. The table baby contains information on baby weights at birth. It records the baby's birth weight in ounces and whether or not the mother smoked during pregnancy for 1174 babies.
# HIDDEN
baby = pd.read_csv('baby.csv')
baby = baby.loc[:, ["Birth Weight", "Maternal Smoker"]]
baby
We would like to see whether maternal smoking was associated with birth weight. To set up our hypothesis test, we can represent the two views of the world using the following null and alternative hypotheses:
Null hypothesis: In the population, the distribution of birth weights of babies is the same for mothers who don't smoke as for mothers who do. The difference in the sample is due to chance.
Alternative hypothesis: In the population, the babies of the mothers who smoke have a lower birth weight, on average, than the babies of the non-smokers.
Our ultimate goal is to make a decision between these two data generation models. One important point to notice is that we construct our hypotheses about the parameters of the data generation model rather than the outcome of the experiment. For example, we should not construct a null hypothesis such as "The birth weights of smoking mothers will be equal to the birth weights of nonsmoking mothers", since there is natural variability in the outcome of this process.
The null hypothesis emphasizes that if the data look different from what the null hypothesis predicts, the difference is due to nothing but chance. Informally, the alternative hypothesis says that the observed difference is "real."
We should take a closer look at the structure of our alternative hypothesis. In our current set up, notice that we would reject the null hypothesis if the birth weights of babies of the mothers who smoke are significantly lower than the birth weights of the babies of the mothers who do not smoke. In other words, the alternative hypothesis encompasses/supports one side of the distribution. We call this a one-sided alternative hypothesis. In general, we would only want to use this type of alternative hypothesis if we have a good reason to believe that it is impossible to see babies of the mothers who smoke have a higher birth weight, on average.
To visualize the data, we've plotted histograms of the baby weights for babies born to maternal smokers and non-smokers.
# HIDDEN
plt.figure(figsize=(9, 6))
smokers_hist = (baby.loc[baby["Maternal Smoker"], "Birth Weight"]
.hist(normed=True, alpha=0.8, label="Maternal Smoker"))
non_smokers_hist = (baby.loc[~baby["Maternal Smoker"], "Birth Weight"]
.hist(normed=True, alpha=0.8, label="Not Maternal Smoker"))
smokers_hist.set_xlabel("Baby Birth Weights")
smokers_hist.set_ylabel("Proportion per Unit")
smokers_hist.set_title("Distribution of Birth Weights")
plt.legend()
plt.show()
The weights of the babies of the mothers who smoked seem lower on average than the weights of the babies of the non-smokers. Could this difference likely have occurred due to random variation? We can try to answer this question using a hypothesis test.
To perform a hypothesis test, we assume a particular model for generating the data; then, we ask ourselves, what is the chance we would see an outcome as extreme as the one that we observed? Intuitively, if the chance of seeing the outcome we observed is very small, the model that we assumed may not be the appropriate model.
In particular, we assume that the null hypothesis and its probability model, the null model, is true. In other words, we assume that the null hypothesis is true and focus on what the value of the statistic would be under under the null hypothesis. This chance model says that there is no underlying difference; the distributions in the samples are different just due to chance.
In our example, we would assume that maternal smoking has no effect on baby weight (where any observed difference is due to chance). In order to choose between our hypotheses, we will use the difference between the two group means as our test statistic. Formally, our test statistic is
so that small values (that is, large negative values) of this statistic will favor the alternative hypothesis. Let's calculate the observed value of test statistic:
nonsmoker = baby.loc[~baby["Maternal Smoker"], "Birth Weight"]
smoker = baby.loc[baby["Maternal Smoker"], "Birth Weight"]
observed_difference = np.mean(smoker) - np.mean(nonsmoker)
observed_difference
If there were really no difference between the two distributions in the underlying population, then whether each mother was a maternal smoker or not should not affect the average birth weight. In other words, the label True or False with respect to maternal smoking should make no difference to the average.
Therefore, in order to simulate the test statistic under the null hypothesis, we can shuffle all the birth weights randomly among the mothers. We conduct this random permutation below.
def shuffle(series):
'''
Shuffles a series and resets index to preserve shuffle when adding series
back to DataFrame.
'''
return series.sample(frac=1, replace=False).reset_index(drop=True)
baby["Shuffled"] = shuffle(baby["Birth Weight"])
baby
Tests based on random permutations of the data are called permutation tests. In the cell below, we will simulate our test statistic many times and collect the differences in an array.
differences = np.array([])
repetitions = 5000
for i in np.arange(repetitions):
baby["Shuffled"] = shuffle(baby["Birth Weight"])
# Find the difference between the means of two randomly assigned groups
nonsmoker = baby.loc[~baby["Maternal Smoker"], "Shuffled"]
smoker = baby.loc[baby["Maternal Smoker"], "Shuffled"]
simulated_difference = np.mean(smoker) - np.mean(nonsmoker)
differences = np.append(differences, simulated_difference)
We plot a histogram of the simulated difference in means below:
# HIDDEN
differences_df = pd.DataFrame()
differences_df["differences"] = differences
diff_hist = differences_df.loc[:, "differences"].hist(normed = True)
diff_hist.set_xlabel("Birth Weight Difference")
diff_hist.set_ylabel("Proportion per Unit")
diff_hist.set_title("Distribution of Birth Weight Differences");
It makes intuitive sense that the distribution of differences is centered around 0 since the two groups should have the same average under the null hypothesis.
In order to draw a conclusion for this hypothesis test, we should calculate the p-value. The empirical p-value of the test is the proportion of simulated differences that were equal to or less than the observed difference.
p_value = np.count_nonzero(differences <= observed_difference) / repetitions
p_value
At the beginning of the hypothesis test we typically choose a p-value threshold of significance (commonly denoted as alpha). If our p-value is below our significance threshold, then we reject the null hypothesis. The most commonly chosen thresholds are 0.01 and 0.05, where 0.01 is considered to be more "strict" since we would need more evidence in favor of the alternative hypothesis to reject the null hypothesis.
In either of these two cases, we reject the null hypothesis since the p-value is less than the significance threshold.
Data scientists must often estimate an unknown population parameter using a random sample. Although we would ideally like to take numerous samples from the population to generate a sampling distribution, we are often limited to a single sample by money and time.
Fortunately, a large, randomly collected sample looks like the original population. The bootstrap procedure uses this fact to simulate new samples by resampling from the original sample.
To conduct the bootstrap, we perform the following steps:

We may use the bootstrap sampling distribution to create a confidence interval which we use to estimate the value of the population parameter.
Since the birth weight data provides a large, random sample, we may act as if the data on mothers who did not smoke are representative of the population of nonsmoking mothers. Similarly, we act as if the data for smoking mothers are representative of the population of smoking mothers.
Therefore, we treat our original sample as the bootstrap population to perform the bootstrap procedure:
This procedure gives us a empirical sampling distribution of differences in mean baby weights.
def resample(sample):
return np.random.choice(sample, size=len(sample))
def bootstrap(sample, stat, replicates):
return np.array([
stat(resample(sample)) for _ in range(replicates)
])
nonsmoker = baby.loc[~baby["Maternal Smoker"], "Birth Weight"]
smoker = baby.loc[baby["Maternal Smoker"], "Birth Weight"]
nonsmoker_means = bootstrap(nonsmoker, np.mean, 10000)
smoker_means = bootstrap(smoker, np.mean, 10000)
mean_differences = smoker_means - nonsmoker_means
We plot the empirical distribution of the difference in means below:
# HIDDEN
mean_differences_df = pd.DataFrame()
mean_differences_df["differences"] = np.array(mean_differences)
mean_diff = mean_differences_df.loc[:, "differences"].hist(normed=True)
mean_diff.set_xlabel("Birth Weight Difference")
mean_diff.set_ylabel("Proportion per Unit")
mean_diff.set_title("Distribution of Birth Weight Differences");
Finally, to construct a 95% confidence interval we take the 2.5th and 97.5th percentiles of the bootstrap statistics:
(np.percentile(mean_differences, 2.5),
np.percentile(mean_differences, 97.5))
This confidence interval allows us to state with 95% confidence that the population mean difference in birth weights is between -11.37 and -7.18 ounces.
In this section, we review hypothesis testing using the permutation test and confidence intervals using the bootstrap. To conduct a hypothesis test, we must state our null and alternative hypotheses, select an appropriate test statistic, and perform the testing procedure to calculate a p-value. To create a confidence interval, we select an appropriate test statistic, bootstrap the original sample to generate an empirical distribution of the test statistic, and select the quantiles corresponding to our desired confidence level.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/18'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
There are several cases where we would like to perform a permuation test in order to test a hypothesis and learn more about the world. A permutation test is a very useful type of non-parametric test that allows us to make inferences without making statistical assumptions that underly traditional parametric tests.
One insightful example of permutation inference is the reexamination of Student Evaluation of Teaching (SET) data by Boring, Ottoboni, and Stark (2016). In this experiment, 47 students were randomly assigned to one of four sections. There are two TAs that teach two sections each; one TA is male and other is female. In two of the sections, the teaching assistants were introduced using their actual names. In the other two sections, the assistants switched names.
#HIDDEN
from IPython.display import Image
display(Image('student_setup.png'))
Students never met the teaching assistants face-to-face. Instead, they interacted with the students via an online forum. Homework returns were coordinated so that all students received scores/feedback all at the same time. The 2 TAs also had comparable levels of experience. At the end of the course, students evaluate the TA on promptness in returning assignment. The authors wanted to investigate if gender perception has any effect on SETs evaluations/ratings.
We conduct a hypothesis test using a p-value cutoff of 0.05.
In our model, each TA has two possible ratings from each student—one for each perceived gender. Each student had an equal chance of being assigned to any one of the (gender, perceived gender) pairs. Finally, the students evaluate their TAs independently of one another.
The null hypothesis of this experiment is that perceived gender has no effect on SETs and any observed difference in ratings is due to chance. In other words, the evaluation of each TA should remain unchanged whether they are percieved as a male or a female. This means that each TA really only has one possible rating from each student.
The alternative hypothesis is that perceived gender has an effect on SETs.
The test statistic is the difference in means of perceived male and perceived female ratings for each TA. Intuitively, we expect this to be close to 0 if gender has no effect on ratings. We can write this formally:
Where:
where is the number of students in the th group and is the rating of the jth student in the ith group.
In order to determine whether gender has an effect on SET ratings, we perform a permutation test to generate an empirical distribution of the test statistic under the null hypothesis. We follow the following steps:
It is important to understand why the permutation test is justified in this scenario. Under the null model, each student would have given their TA the same rating regardless of perceived gender. Simple random assignment then implies that for a given TA, all of their ratings had an equal chance of showing up regardless of whether they were perceived as male or female. Therefore, permuting the gender labels should have no effect on the ratings if the null hypothesis were true.
We begin with the student and gender data below. These data are a census of 47 students enrolled in an online course at a U.S. university.
#HIDDEN
student_eval = (
pd.read_csv('StudentRatingsData.csv')
.loc[:, ["tagender", "taidgender", "prompt"]]
.dropna()
.rename(columns={'tagender': 'actual', 'taidgender': 'perceived'})
)
student_eval[['actual', 'perceived']] = (
student_eval[['actual', 'perceived']]
.replace([0, 1], ['female', 'male'])
)
student_eval
The columns have the following meanings:
actual – the true gender of the TA
perceived – the gender presented to the students
prompt – rating on promptness of HW on a scale from 1 to 5
After analyzing and plotting the ratings data from the experiment below, there appears to be a difference between the groups of students, with perceived female ratings lower than male ratings; however, we need a more formal hypothesis test to see if this difference could simply be due to the random assignment of students.
# HIDDEN
avg_ratings = (student_eval
.loc[:, ['actual', 'perceived', 'prompt']]
.groupby(['actual', 'perceived'])
.mean()
.rename(columns={'prompt': 'mean prompt'})
)
avg_ratings
# HIDDEN
fig, ax = plt.subplots(figsize=(12, 7))
ind = np.arange(4)
plt.bar(ind, avg_ratings["mean prompt"])
ax.set_xticks(ind)
ax.set_xticklabels(['Female (Percieved Female)', 'Female (Percieved Male)', 'Male (Percieved Female)', "Male (Percieved Male)"])
ax.set_ylabel('Average Promptness Rating')
ax.set_xlabel('Actual/Percieved Gender')
ax.set_title('Average Rating Per Actual/Percieved Gender')
plt.show()
We will compute the observed difference between the average ratings of the identified male and identified female groups:
def stat(evals):
'''Computes the test statistic on the evals DataFrame'''
avgs = evals.groupby('perceived').mean()
return avgs.loc['female', 'prompt'] - avgs.loc['male', 'prompt']
observed_difference = stat(student_eval)
observed_difference
We see that the difference is -0.8 — in this case, the average rating for those identified as female is nearly 1 point lower on a scale from 1 to 5. Given the scale of the ratings, this difference appears to be quite large. By performing a permutation test, we will be able to find the chance of observing a difference this large under our null model.
Now, we can permute the perceived gender labels for each TA and calculate the test statistic 1,000 times:
def shuffle_column(df, col):
'''Returns a new copy of df with col shuffled'''
result = df.copy()
result[col] = np.random.choice(df[col], size=len(df[col]))
return result
repetitions = 1000
gender_differences = np.array([
stat(shuffle_column(student_eval, 'perceived'))
for _ in range(repetitions)
])
We plot the approximate sampling distribution of the difference in scores using our permutations below, showing the observed value using a red dotted line.
# HIDDEN
differences_df = pd.DataFrame()
differences_df["gender_differences"] = gender_differences
gender_hist = differences_df.loc[:, "gender_differences"].hist(normed=True)
gender_hist.set_xlabel("Average Gender Difference (Test Statistic)")
gender_hist.set_ylabel("Percent per Unit")
gender_hist.set_title("Distribution of Gender Differences")
plt.axvline(observed_difference, c='r', linestyle='--');
# HIDDEN
differences_df = pd.DataFrame()
differences_df["gender_differences"] = gender_differences
gender_hist = differences_df.loc[:, "gender_differences"].hist(normed=True)
gender_hist.set_xlabel("Average Gender Difference (Test Statistic)")
gender_hist.set_ylabel("Percent per Unit")
gender_hist.set_title("Distribution of Gender Differences")
plt.axvline(observed_difference, c='r', linestyle='--')
From our calculation below, only 18 of the 1000 simulations had a difference at least as large as the one observed. Therefore, our p-value is less than the 0.05 threshold and we reject the null hypothesis in favor of the alternative.
#Sample Distribution Parameters
sample_sd = np.std(gender_differences)
sample_mean = np.mean(gender_differences)
#Computing right-hand extreme value
num_sd_away = (sample_mean - observed_difference)/sample_sd
right_extreme_val = sample_mean + (num_sd_away*sample_sd)
#Calculate P-value
num_extreme_left = np.count_nonzero(gender_differences <= observed_difference)
num_extreme_right = np.count_nonzero(gender_differences >= right_extreme_val)
empirical_P = (num_extreme_left + num_extreme_right) / repetitions
empirical_P
Through this permuatation test, we have shown that SET are biased against female instructors by an amount that is large and statistically significant.
There are other studies that have also tested bias within teaching evaluations. According to Boring, Ottoboni & Stark 2016, there were several other parametric tests conducted that assumed ratings of male and female instructors are independent random samples from normally distributed populations with equal variances; this type of experimental design does not align with the proposed null hypothesis, causing the p-values to be potentially misleading.
In contrast, Boring, Ottoboni & Stark 2016 used permutation tests based on random assignment of students to class sections. Recall that during our permutation test, we did not make any underlying assumptions about the distribution of our data. In this experiment, we did not assume that students, SET scores, grades, or any other variables comprise random samples from any populations, much less populations with normal distributions.
When testing a hypothesis, it is very important to carefully choose your experiment design and null hypothesis in order to obtain reliable results.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/18'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
Recall that in linear regression, we fit a model of the following form
We would like to infer the true coefficients of the model. Since the , , are estimators that vary based on our training data/observations, we would like to understand how our estimated coefficients compare with the true coefficients. Bootstrapping is a nonparametric approach to statistical inference that gives us standard errors and confidence intervals for our parameters.
Let's take a look at an example of how we use bootstrapping methods within linear regression.
Otis Dudley Duncan was a quantitative sociologist interested in measuring the prestige levels of different occupations. There were only 90 occupations that were rated for their prestige level in the 1947 National Opinion Research Center (NORC) survey. Duncan wanted to “fill in” prestige scores for unrated occupations by using income and education data about each occupation recorded by the 1950 census. When joining the NORC data with the 1950 census data, only 45 occupations could be matched. Ultimately, Duncan's goal was to create a model to explain prestige using different characteristics; using this model, one can predict the prestige of other occupations not recorded in the NORC survey.
The Duncan dataset is a random sample of that contains information on the prestige and other characteristics of 45 U. S. occupations in 1950. The variables are:
occupation represents the type of occupation/title.
income represents the percentage of occupational incumbents who earned incomes in excess of $3,500.
education represents the percentage of incumbents in the occupation in the 1950 U.S. Census who were high school graduates.
prestige represents the percentage of respondents in a survey who rated an occupation as “good” or “excellent” in prestige.
duncan = pd.read_csv('duncan.csv').loc[:, ["occupation", "income", "education", "prestige"]]
duncan
It is usually a good idea to explore the data through visualization in order to gain an understanding of the relationships between your variables. Below, we will visualize the correlations between income, educationa and prestige.
plt.scatter(x=duncan["education"], y=duncan["prestige"])
plt.scatter(x=duncan["income"], y=duncan["prestige"])
plt.scatter(x=duncan["income"], y=duncan["education"])
From the plots above, we see that both education and income are positively correlated with prestige; hence, both of these variables might be useful in helping explain prestige. Let's fit a linear model using these explanatory variables to explain prestige.
We will fit the following model, that explains the prestige of an occupation as a linear function of income and education:
$$ \begin{aligned} \texttt{prestige}_i = \theta_0^*
In order to fit this model, we will define the design matrix (X) and our response variable (y):
X = duncan.loc[:, ["income", "education"]]
X.head()
y = duncan.loc[:, "prestige"]
y.head()
Below, we fit our linear model and print all the coefficients of the model (from the equation above) after the model has been fit to the data. Note that these are our sample coefficients.
import sklearn.linear_model as lm
linear_model = lm.LinearRegression()
linear_model.fit(X, y)
print("""
intercept: %.2f
income: %.2f
education: %.2f
""" % (tuple([linear_model.intercept_]) + tuple(linear_model.coef_)))
The coefficients above give us an estimate of the true coefficients. But had our sample data been different, we would have fit our model to different data, causing these coefficients to be different. We would like to explore what our coefficients might have been using bootstrapping methods.
In our bootstrapping methods and analysis, we will focus on the coefficient of education. We would like to explore the partial relationship between prestige and education holding income constant (rather than the marginal relationship between prestige and education ignoring income). The partial regression coefficient illustrates the partial relationship between prestige and education within our data.
In this method, we consider the pairs to be our sample, so we construct the bootstrap resample by sampling with replacement from these pairs:
In other words, we sample n observations with replacement from our data points; this is our bootstrap sample. Then we will fit a new linear regression model to this sampled data and record the education coefficient ; this coefficient is our bootstrap statistic.
def simple_resample(n):
return(np.random.randint(low = 0, high = n, size = n))
def bootstrap(boot_pop, statistic, resample = simple_resample, replicates = 10000):
n = len(boot_pop)
resample_estimates = np.array([statistic(boot_pop[resample(n)]) for _ in range(replicates)])
return resample_estimates
def educ_coeff(data_array):
X = data_array[:, 1:]
y = data_array[:, 0]
linear_model = lm.LinearRegression()
model = linear_model.fit(X, y)
theta_educ = model.coef_[1]
return theta_educ
data_array = duncan.loc[:, ["prestige", "income", "education"]].values
theta_hat_sampling = bootstrap(data_array, educ_coeff)
plt.figure(figsize = (7, 5))
plt.hist(theta_hat_sampling, bins = 30, normed = True)
plt.xlabel("$\\tilde{\\theta}_{educ}$ Values")
plt.ylabel("Proportion per Unit")
plt.title("Bootstrap Sampling Distribution of $\\tilde{\\theta}_{educ}$ (Nonparametric)");
plt.show()
Notice how the sampling distribution above is slightly skewed to the left.
Although we cannot directly measure we can use a bootstrap confidence interval to account for variability in the sample regression coefficient . Below, We construct an approximate 95% confidence interval for the true coefficient , using the bootstrap percentile method. The confidence interval extends from the 2.5th percentile to the 97.5th percentile of the 10,000 bootstrapped coefficients.
left_confidence_interval_endpoint = np.percentile(theta_hat_sampling, 2.5)
right_confidence_interval_endpoint = np.percentile(theta_hat_sampling, 97.5)
left_confidence_interval_endpoint, right_confidence_interval_endpoint
From the confidence interval above, we are fairly certain that the true coefficient lies between 0.236 and 0.775.
We can also create confidence intervals based on normal theory. Since the values appear normally distributed, we can construct a confidence interval using by computing the following:
where is the standard error of our bootstrapped coefficients, is a constant, and is our sample coefficient. Note that varies depending on the confidence level of the interval we are constructing. Since we are creating a 95% confidence interval, we will use 1.96.
# We will use the statsmodels library in order to find the standard error of the coefficients
import statsmodels.api as sm
ols = sm.OLS(y, X)
ols_result = ols.fit()
# Now you have at your disposition several error estimates, e.g.
ols_result.HC0_se
left_confidence_interval_endpoint_normal = 0.55 - (1.96*0.12)
right_confidence_interval_endpoint_normal = 0.55 + (1.96*0.12)
left_confidence_interval_endpoint_normal, right_confidence_interval_endpoint_normal
Observations: Notice how the confidence interval using normal theory is more narrow than the confidence interval using the percentile method, especially towards the left of the interval.
We will not go into the normal theory confidence interval in great detail, but if you would like to learn more, refer to X.
Although we observed a positive partial relationship between education and prestige (from the 0.55 coefficient), what if the true coefficient is actually 0 and there is no partial relationship between education and prestige? In this case, the association that we observed was just due to variability in obtaining the points that form our sample.
To formally test whether the partial relationship between education and prestige is real, we would like to test the following hypotheses:
Null Hypothesis: The true partial coefficient is 0.
Alternative Hypothesis. The true partial coefficient is not 0.
Since we have already contructed a 95% confidence interval for the true coefficient, we just need to see whether 0 lies within this interval. Notice that 0 does not lie within our confidence interval above; therefore, we have enough evidence to reject the null hypothesis.
If the confidence interval for the true coefficient did contain 0, then we would not have enough evidence to reject the null hypothesis. In this case, the observed coefficient would likely spurious.
In order to build the sampling distribution of the coefficient and contruct the confidence interval for the true coefficient, we directly resampled the observations and fitted new regression models on our bootstrap samples. This method implicitly treats the regressors as random rather than fixed.
In some cases, we may want to treat the as fixed (if, for example, the data were derive from an experimental design). In the case where the explanatory variables were controlled for, or the values of the explanatory variables were set by the experimenter, then we may consider the following alternative bootstrapping method.
Another approach to hypothesis testing in linear regression is bootstrapping the residuals. This approach makes many underlying assumptions and used less frequently in practice. In this method, we consider the residuals to be our sample, so we construct the bootstrap resample by sampling with replacement from these residuals. Once we construct each bootstrap sample, we can compute new fitted values using these residuals. Then, we regress these new Y values onto the fixed X values to obtain bootstrap regression coefficients.
For more clarity, let's break this method down into steps:
Estimate the regression coefficients for the original sample, and calculate the fitted value and residual for each observation.
Select bootstrap samples of the residuals; we will denote these bootstrapped residuals as . Then, calculate bootstrapped values by computing where the fitted values are obtained from the original regression.
Regress the bootstrapped values on the fixed values to obtain bootstrap regression coefficients .
Repeat steps two and three several times in order to obtain several bootstrap regression coefficients . These can be used to compute bootstrap standard errors and confidence intervals.
Now that we have the bootstrapped regression coefficients, we can construct a confidence interval using the same techniques as before. We will leave this as an exercise.
Let's reflect on this method. By randomly reattaching resampled residuals to fitted values, this procedure implicitly assumes that the errors are identically distributed. More specifically, this method assumes that the distribution of fluctuations around the regression curve is the same for all values of the input . This is a disadvantage because the true errors may have nonconstant variance; this phenomenon is called heteroscedasticity.
Although this method does not make any assumptions about the shape of the error distribution, it implicitly assumes that the functional form of the model is correct. By relying on the model to create each bootstrap sample, we assume that the model structure is appropriate.
In this section, we highlight bootstrapping techniques used in a linear regression setting.
In general, bootstrapping the observations is more commonly used for bootstrapping. This method is often more robust than other techniques because it makes less underlying assumptions; for example, if an incorrect model is fitted, this method will still yield an appropriate sampling distribution of the parameter of interest.
We also highlight an alternative method, which has several disadvantages. Bootstrapping the residuals can be used when we would like the treat our observations as fixed. Note that this method should be used with caution because it makes additional assumptions about the errors and form of the model.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/18'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0, col=0):
return df.iloc[row:row + nrows, col:col + ncols]
if len(df.columns) <= ncols:
interact(peek, row=(0, len(df) - nrows, nrows), col=fixed(0))
else:
interact(peek,
row=(0, len(df) - nrows, nrows),
col=(0, len(df.columns) - ncols))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
# HIDDEN
times = pd.read_csv('ilec.csv')['17.5']
The bootstrap is a process we learned about in Data 8 that we can use for estimating a population statistic using only one sample. The general procedure for bootstrapping is as follows:
Here, we end up with many test statistics from individual resamples, from which we can form a distribution. In Data 8, we were taught to form a 95% confidence interval by taking the 2.5th percentile and the 97.5th percentile of the bootstrap statistics. This method of bootstrapping to create a confidence interval is called the percentile bootstrap. 95% confidence implies that if we take a new sample from the population and construct a confidence interval, the confidence interval will contain the population parameter with probability 0.95. However, it is important to note that confidence intervals created from real data can only approximate 95% confidence. The percentile bootstrap in particular has lower confidence than desired at small sample sizes.
Below, we've taken a population and created one thousand bootstrap 95% confidence intervals for the population mean for different sample sizes. The y-axis represents the fraction of the one thousand confidence intervals that contained the real population mean. Notice that at sample sizes below 20, fewer than 90% of the confidence intervals actually contain the population mean.

We can measure coverage error by calculating the difference between our measured confidence here and our desired 95% confidence. We can see that the coverage error for percentile bootstrap is very high at small sample sizes. In this chapter, we will introduce a new bootstrap method, called the studentized bootstrap method, that has a lower coverage error but requires more computation.
The New York Public Utilities Commission monitors the response time for repairing land-line phone service in the state. These repair times may differ over the year and according to the type of repair. We have a census of repair times for one class of repairs at one time period for a specific Incumbent Local Exchange Carrier, which is a telephone company which held the regional monopoly on landline service before the market was opened to competitive local exchange carriers, or the corporate successor of such a firm. The commission is interested in estimates of the average repair time. First, let's look at a distribution of all of the times.
plt.hist(times, bins=20, normed=True)
plt.xlabel('Repair Time')
plt.ylabel('Proportion per Hour')
plt.title('Distribution of Repair Times');
Let's say we want to estimate the population mean of the repair times. We first need to define the main statistic function needed to do this. By passing in the whole population, we can see that actual average repair time is about 8.4 hours.
def stat(sample, axis=None):
return np.mean(sample, axis=axis)
theta = stat(times)
theta
Now we need to define a method that will return a list of indices so we can resample from the original sample without replacement.
def take_sample(n=10):
return np.random.choice(times, size=n, replace=False)
In real life, we won't be able to draw many samples from the population (we use bootstrap to be able to use just one sample). But for demonstration purposes, we have access to the entire population, so we will take 1000 samples of size 10 and plot the distribution of the sample means.
samples_from_pop = 1000
pop_sampling_dist = np.array(
[stat(take_sample()) for _ in range(samples_from_pop)]
)
plt.hist(pop_sampling_dist, bins=30, normed=True);
plt.xlabel('Average Repair Time')
plt.ylabel('Proportion per Hour')
plt.title(r'Distribution of Sample Means ($\hat{\theta}$)');
We can see that the center of this distribution is ~5, and that it is skewed right because of the skewed distribution of the original data.
Now we can look at how a single bootstrap distribution can stack up against a distribution sampled from the population.
Generally, we are aiming to estimate , our population parameter (in this case, the average repair time of the population, which we found to be ~8.4). Each individual sample can be used to calculate an estimated statistic, (in this case, the average repair time of a single sample). The plot above shows what we call an empirical distribution of , which is calculated of many estimated statistics from many samples from the population. For the bootstrap, however, we want the statistic of the resample of the original sample, which is called .
In order for the bootstrap to work, we want our original sample to look similar to the population, so that resamples also look similar to the population. If our original sample does look like the population, then the distribution of average repair times calculated from the resamples will look similar to the empirical distribution of samples directly from the population.
Let's take a look at how an individual bootstrap distribution will look. We can define methods to take samples of size 10 without replacement and bootstrap it 1000 times to get our distribution.
bootstrap_reps = 1000
def resample(sample, reps):
n = len(sample)
return np.random.choice(sample, size=reps * n).reshape((reps, n))
def bootstrap_stats(sample, reps=bootstrap_reps, stat=stat):
resamples = resample(sample, reps)
return stat(resamples, axis=1)
np.random.seed(0)
sample = take_sample()
plt.hist(bootstrap_stats(sample), bins=30, normed=True)
plt.xlabel('Average Repair Time')
plt.ylabel('Proportion per Hour')
plt.title(r'Distribution of Bootstrap Sample Means ($\tilde{\theta}$)');
As you can see, our distribution of doesn't look quite like the distribution of , likely because our original sample did not look like the population. As a result, our confidence intervals perform rather poorly. Below we can see a side-by-side comparison of the two distributions:
plt.figure(figsize=(10, 4))
plt.subplot(121)
plt.xlabel('Average Repair Time')
plt.ylabel('Proportion per Hour')
plt.title(r'Distribution of Sample Means ($\hat{\theta}$)')
plt.hist(pop_sampling_dist, bins=30, range=(0, 40), normed=True)
plt.ylim((0,0.2))
plt.subplot(122)
plt.xlabel('Average Repair Time')
plt.ylabel('Proportion per Hour')
plt.title(r'Distribution of Bootstrap Sample Means ($\tilde{\theta}$)')
plt.hist(bootstrap_stats(sample), bins=30, range=(0, 40), normed=True)
plt.ylim((0,0.2))
plt.tight_layout();
As we saw, the main issue with percentile bootstrap procedure is that it takes a larger sample size to really reach the desired 95% confidence. With the studentized bootstrap procedure, we can do a little more calculation to get better coverage at smaller sample sizes.
The idea behind the studentized bootstrap procedure is to normalize the distribution of the test statistic to be centered at 0 and have a standard deviation of 1. This will correct for the spread difference and skew of the original distribution. In order to do all of this, we need to do some derivation first.
In the percentile bootstrap procedure, we generate many values of , and then we take the 2.5th and 97.5th percentiles for our confidence interval. For short, we refer to these percentiles as and . Note that both of these values come from the bootstrap statistics.
With this procedure, we hope that the probability that the actual population statistic lies between our confidence intervals is about 95%. In other words, we hope for the following equality:
We make two approximations during this procedure: since we assume our random sample looks like the population, we approximate with ; since we assume a random resample looks like the original sample, we approximate with . Since the second approximation relies on the first one, they both introduce error in the confidence interval generation, which creates the coverage error we saw in the plot.
We aim to reduce this error by normalizing our statistic. Instead of using our calculated value of directly, we use:
This will normalize the resample statistic by the sample statistic, and then divide by the standard deviation of the resample statistic (this standard deviation is also called the standard error, or SE).
This whole normalized statistic is called the Student's t-statistic, so we call this bootstrap method the studentized bootstrap or the bootstrap-t method.
As usual, we compute this statistic for many resamples, and then take the 2.5th and 97.5th percentiles — and . As such, we hope that the normalized population parameter lies between these percentiles:
We can now solve the inequality for :
This means we can construct our confidence interval using just (the test statistic on the original sample), and (the percentiles of the normalized statistic computed with the resamples), and (the standard deviation of the sample test statistic). This last value is estimated by using the standard deviation of the resample test statistics.
Thus, to compute a studentized bootstrap CI, we perform the following procedure:
It is important to note that , the standard error of the resample test statistic, is not always easy to compute and is dependent on the test statistic. For the sample mean, , the standard deviation of the resample values divided by the square root of the sample size.
Also remember that we have to use the resample values to compute ; we use the sample values to compute .
If our test statistic, however, does not have an analytic expression (like the one we have for the sample mean), then we need to do a second-level bootstrap. For each resample, we bootstrap it again, and compute the test statistic on each second-level resample (the resampled resample), and compute the standard deviation using these second-level statistics. Typically, we do around 50 second-level resamples.
This greatly increases computation time for the studentized bootstrap procedure. If we do 50 second-level resamples, the entire procedure will take 50 times as long as if we just had an analytic expression for .
To assess the tradeoffs of studentized and percentile bootstrap, let's compare the coverage of the two methods using the repair times dataset.
plt.hist(times, bins=20, normed=True);
plt.xlabel('Repair Time')
plt.ylabel('Proportion per Hour')
plt.title('Distribution of Repair Times');
We will take many samples from the population, compute a percentile confidence interval and a studentized confidence interval for each sample, and then compute the coverage for each. We will repeat this for varying sample sizes to see how the coverage of each method changes with sample size.
We can use np.percentile to compute the percentile confidence interval below:
def percentile_ci(sample, reps=bootstrap_reps, stat=stat):
stats = bootstrap_stats(sample, reps, stat)
return np.percentile(stats, [2.5, 97.5])
np.random.seed(0)
sample = take_sample(n=10)
percentile_ci(sample)
To do the studentized bootstrap, we need a lot more code:
def studentized_stats(sample, reps=bootstrap_reps, stat=stat):
'''
Computes studentized test statistics for the provided sample.
Returns the studentized test statistics and the SD of the
resample test statistics.
'''
# Bootstrap the sample and compute \tilde \theta values
resamples = resample(sample, reps)
resample_stats = stat(resamples, axis=1)
resample_sd = np.std(resample_stats)
# Compute SE of \tilde \theta.
# Since we're estimating the sample mean, we can use the formula.
# Without the formula, we would have to do a second level bootstrap here.
resample_std_errs = np.std(resamples, axis=1) / np.sqrt(len(sample))
# Compute studentized test statistics (q values)
sample_stat = stat(sample)
t_statistics = (resample_stats - sample_stat) / resample_std_errs
return t_statistics, resample_sd
def studentized_ci(sample, reps=bootstrap_reps, stat=stat):
'''
Computes 95% studentized bootstrap CI
'''
t_statistics, resample_sd = studentized_stats(sample, reps, stat)
lower, upper = np.percentile(t_statistics, [2.5, 97.5])
sample_stat = stat(sample)
return (sample_stat - resample_sd * upper,
sample_stat - resample_sd * lower)
np.random.seed(0)
sample = take_sample(n=10)
studentized_ci(sample)
Now that everything is written out, we can compare the coverages of the two methods as the sample size increases from 4 to 100.
def coverage(cis, parameter=theta):
return (
np.count_nonzero([lower < parameter < upper for lower, upper in cis])
/ len(cis)
)
def run_trials(sample_sizes):
np.random.seed(0)
percentile_coverages = []
studentized_coverages = []
for n in sample_sizes:
samples = [take_sample(n) for _ in range(samples_from_pop)]
percentile_cis = [percentile_ci(sample) for sample in samples]
studentized_cis = [studentized_ci(sample) for sample in samples]
percentile_coverages.append(coverage(percentile_cis))
studentized_coverages.append(coverage(studentized_cis))
return pd.DataFrame({
'percentile': percentile_coverages,
'studentized': studentized_coverages,
}, index=sample_sizes)
%%time
trials = run_trials(np.arange(4, 101, 2))
trials.plot()
plt.axhline(0.95, c='red', linestyle='--', label='95% coverage')
plt.legend()
plt.xlabel('Sample Size')
plt.ylabel('Coverage')
plt.title('Coverage vs. Sample Size for Studentized and Percentile Bootstraps');
As we can see, the studentized bootstrap has a much better coverage at smaller sample sizes.
The studentized bootstrap for the most part is better than the percentile bootstrap, especially if we only have a small sample to start with. We generally want to use this method when the sample size is small or when the original data is skewed. The main drawback is its computation time, which is further magnified if is not easy to compute.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/18'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
As we discussed, a p-value or probability value is the chance, based on the model in the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative. If a p-value is small, that means the tail beyond the observed statistic is small and so the observed statistic is far away from what the null predicts. This implies that the data support the alternative hypothesis better than they support the null. By convention, when we see that the p-value is below 0.05, the result is called statistically significant, and we reject the null hypothesis.
There are dangers that present itself when the p-value is misused. P-hacking is the act of misusing data analysis to show that patterns in data are statistically significant, when in reality they are not. This is often done by performing multiple tests on data and only focusing on the tests that return results that are significant.
In this section, we will go over a few examples of the dangers of p-values and p-hacking.
One of the biggest dangers of blindly relying on the p-value to determine "statistical significance" comes when we are just trying to find the "sexiest" results that give us "good" p-values. This is commonly done when doing "food frequency questionairres," or FFQs, to study eating habits' correlation with other characteristics (diseases, weight, religion, etc). FiveThirtyEight, an online blog that focuses on opinion poll analysis among other things, made their own FFQ, and we can use their data to run our own analysis to find some silly results that can be considered "statistically significant."
data = pd.read_csv('raw_anonymized_data.csv')
# Do some EDA on the data so that categorical values get changed to 1s and 0s
data.replace('Yes', 1, inplace=True)
data.replace('Innie', 1, inplace=True)
data.replace('No', 0, inplace=True)
data.replace('Outie', 0, inplace=True)
# These are some of the columns that give us characteristics of FFQ-takers
characteristics = ['cat', 'dog', 'right_hand', 'left_hand']
# These are some of the columns that give us the quantities/frequencies of different food the FFQ-takers ate
ffq = ['EGGROLLQUAN', 'SHELLFISHQUAN', 'COFFEEDRINKSFREQ']
We will look specifically whether people own cats, dogs, or what handedness they are.
data[characteristics].head()
Additionally, we will look at how much shellfish, eggrolls, and coffee people consumed.
data[ffq].head()
So now we can calculate the p-value for every pair of characteristic and food frequency/quantity features.
# HIDDEN
def findpvalue(data, c, f):
return stat.pearsonr(data[c].tolist(), data[f].tolist())[1]
# Calculate the p value between every characteristic and food frequency/quantity pair
pvalues = {}
for c in characteristics:
for f in ffq:
pvalues[(c,f)] = findpvalue(data, c, f)
pvalues
Our study finds that:
| Eating/Drinking | is linked to: | P-value |
|---|---|---|
| Egg rolls | Dog ownership | <0.0001 |
| Shellfish | Right-handedness | 0.0002 |
| Shellfish | Left-handedness | 0.0004 |
| Coffee | Cat ownership | 0.0016 |
Clearly this is flawed! Aside from the fact that some of these correlations seem to make no sense, we also found that shellfish is linked to both right and left handedness! Because we blindly tested all columns against each other for statistical significance, we were able to just choose whatever pairs gave us "statistically significant" results. This shows the dangers of blindly following the p-value without a care for proper experimental design.
A/B testing is a very simple concept. We measure a statistic in a normal, controlled environment (we'll call this A), and then we compare that to the same statistic in an environment with one change. This form of testing is used frequently in marketing and ad research to compare the effectiveness of certain features of ads.
Let's say we are working for a company whose website lets users make their own custom videogames. The company has a free version, which lets users make very basic videogames, and a paid version, which gives users access to more advanced tools for making videogames. When a user has finished making a videogame via a free account, we send them to a landing page that gives them the option to sign up for a paid account. Our measured statistic in this case would be how many free users sign up for a paid account upon reaching this page. We can send half of our users one version of the page, which may have text explaining in detail the benefits of the paid account (this will be version A), and the other half of our users will get another version of the page, which may have a colorful graphic that explains some of the benefits of the paid account (this will be version B).
There is a very specific reason why it's called A/B testing, and not A/B/C/D... testing. That is because we can very easily run into problems if we try to test multiple versions at the same time.
Let's say that we have 15 different sign up pages (one is the control, in this case "A"), each with something different about them (one has a picture of a puppy, one has a quote from a customer, one has a graphic, etc.), and let's say that in this case, none of our variations actually has an effect on user interaction (so we can use a Gaussian distribution with a mean of 0 and a std of 0.1).
# HIDDEN
n = 50
reps = 1000
num_pages = 15
np.random.seed(11)
def permute(A, B):
combined = np.append(A, B)
shuffled = np.random.choice(combined, size=len(combined), replace=False)
return shuffled[:n], shuffled[n:]
def permutedpvalue(A, B):
obs = test_stat(A, B)
resampled = [test_stat(*permute(A, B)) for _ in range(reps)]
return np.count_nonzero(obs >= resampled) / reps
n = 50
reps = 1000
num_pages = 15
# This will represent percentage of users that make a paid account from the landing page
# Note that all pages have no effect, so they all just have a base 10% of interactions.
landing_pages = [np.random.normal(0.1, 0.01, n) for _ in range(num_pages)]
# This will be our "control"
A = landing_pages[0]
# Our test statistic will be the difference between the mean percentage
def test_stat(A, B):
return np.abs(np.mean(B) - np.mean(A))
p_vals = []
for i in range(1, num_pages):
# We test against each of the non-control landing pages
B = landing_pages[i]
p_val = permutedpvalue(A, B)
p_vals.append(p_val)
print(p_vals)
sns.distplot(p_vals, bins=8, kde=False)
plt.xlim((0,1))
plt.show()
As we can see, more than one of these ads seems to have p-values less than 0.05, despite our knowing that there actually no difference between the pages. This is why we do single A/B testing with multiple trials, as opposed to multiple hypothesis testing with only single trials. It is too easy for a p-value to give us a false positive if we just try a bunch of times.
Sometimes, multiple testing can happen by accident. If many researchers are investigating the same phenomenon at the same time, then it's very possible that one of the researchers can end up with a lucky trial. That is exactly what happened during the 2010 World Cup.
Paul the Octopus was a common octopus who lived in a Sea Life Centre in Oberhausen, Germany. He is most well known for correctly guessing all seven soccer matches Germany played during the 2010 World Cup, as well as the final match, which was between Netherlands and Spain.
Before a match was played, Paul's owners would place two boxes in his tank containing food, each box labeled with a different flag of the opposing countries. Whichever box Paul chose to eat from first was considered his prediction for the outcome of the match.

So why was Paul so good at predicting the outcome of these matches? Was he actually psychic, or was he just lucky? We might ask what’s the chance he got all of the predictions correct, assuming he was just “guessing”?
Paul correctly predicted 8 of the 2010 World Cup games, each time he had a 1/2 chance of making the correct prediction. The one way to get all 8 matches correct out of 8 is:
So was he actually psychic? Or is there something more to uncover?
Turns out, there were tons of animals (some of them in the same zoo as Paul!) doing the same thing, trying to guess the outcome of their respective home countries' matches, including:
Some might argue that getting them all wrong would also be remarkable. So what are the chances that at least one of the 12 animals would get either all right or all wrong?
We can use simple probability to figure this out. We have 12 trials (in this case, animals), where each independent trial has a chance of getting all predictions right or wrong. So what is the probability of having at least one success? That's
We have an 9% chance of getting an animal that will select all of the right predictions, and that's not including all of the animals in the world that were also doing these "predictions." That's not that rare - it's the dangers of multiple testing that caused this "phenomenon." This one octopus out of many different animals in the world happened to have guessed all of the right predictions, and the popularity of the situation caused it to become magical.
To those of you wondering if it really was luck, it has been shown that the species Octopus vulgaris is actually colorblind, and some believe that octopuses are drawn to horizontal shapes, hence Paul's decision to choose Germany, except when playing against Spain and Serbia.
In the end, we know that studies are more trustworthy when they are replicated. Data scientists should try to avoid cases like Paul the Octopus's where there has only been one real case of the animal correctly predicting a bunch of World Cup matches in a row. Only when we see him doing that for multiple soccer tournaments should we start looking at the data.
As it turns out, p-hacking isn't the only thing data scientists and statisticians have to worry about when making sound inferences from data. There are many stages to the design and analysis of a successful study, as shown below (from Leek & Peng's P values are just the tip of the iceberg).

As shown, the last step of the whole "data pipeline" is the calculation of an inferential statistic like the p-value, and having a rule applied to it (e.g. p > 0.05). But there are many other decisions that are made beforehand, like experimental design or EDA, that can have much greater effects on the results - mistakes like simple rounding or measurement errors, choosing the wrong model, or not taking into account confounding factors can change everything. By changing the way data are cleaned, summarized, or modeled, we can achieve arbitrary levels of statistical significance.
A simple example of this would be in the case of rolling a pair of dice and getting two 6s. If we were to take a null hypothesis that the dice are fair and not weighted, and take our test statistic to be the sum of the dice, we will find that the p-value of this outcome will be 1/36 or 0.028, and gives us statistically signficant results that the dice are fair. But obviously, a single roll is not nearly enough rolls to provide us with good evidence to say whether the results are statistically significant or not, and shows that blindly applying the p-value without properly designing a good experiment can result in bad results.
In the end, what is most important is education on the subject of safe hypothesis testing, and making sure you don't fall into the follies of poor statistical decisions.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/19'))
A vector is defined by a length and a direction.

Notice that and have the same length and direction. They are equal vectors.
To scale a vector is to change it's length.

Notice that and have the direction but different lengths. They are not equal.
To add two vectors , take one step according to the length of , then immediately take one step according to the length of (or vice versa). This is also known as triangle method, where you place the initial point of a vector on the terminal point of the other.

Vectors are usually represented as Cartesian Coordinates.

In this notation, arithmetic operations we saw earlier become quite easy.

Vectors can be added and scaled element-wise:
In any dimensional space, is the vector of all 's:
The span of a set of vectors is the set of all possible linear combinations. For these vectors:
where is the field of the vector space (out of scope).
A vector space is the span of a set of vectors , where each is a dimensional column vector.
A subspace of is the span of a set of vectors where each . This means every vector in is also in .
When you put any two vectors terminal end to terminal end without changing their direction, you can measure the angle between them.

Intuition in :
Recall the triangle method of adding two vectors. If we add two perpendicular vectors in , then we know that the resulting vector will be the hypotenuse. In this case, we also know that the length of will follow the Pythagorean Theorem: .

General Formula for Length of :
Where the final operator is the dot product.
The first expression is known as the algebraic definition of the dot product, and the second is the geometric definition. Note that the dot product is the inner product defined for vectors in .

For two non-zero vectors to be orthogonal, they must satisfy the property that . Since they have non-zero length, the only way for the two vectors to be orthogonal is if . One satisfying is 90 degrees, our familiar right angle.
To project one vector onto another vector , we want to find that is closest to .

By the Pythagorean Theorem, we know that must be the scalar such that is perpendicular to , so is the (orthogonal) projection of onto .
Likewise, to project one vector onto any vector space spanned by some set of vectors , we still find the linear combination that is closest to .

# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/20'))
This appendix contains reference tables for the pandas, seaborn,
matplotlib, and scikit-learn methods used in the book. It is meant to
provide a helpful overview of the small subset of methods that we use most
often in this book.
For each library, we list the methods used, the chapter where each method is first mentioned, and a brief description of the method's functionality.
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/20'))
| Function | Chapter | Description |
|---|---|---|
pd.DataFrame(data) |
Tabular Data and pandas | Create a DataFrame from a two-dimensional array or dictionary data |
pd.read_csv(filepath) |
Tabular Data and pandas | Import a CSV file from filepath as a pandas DataFrame |
pd.DataFrame.head(n=5)pd.Series.head(n=5) |
Tabular Data and pandas | View the first n rows of a DataFrame or Series |
pd.DataFrame.indexpd.DataFrame.columns |
Tabular Data and pandas | View a DataFrame's index and column values |
pd.DataFrame.describe()pd.Series.describe() |
Exploratory Data Analysis | View descriptive statistics about a DataFrame or Series |
pd.Series.unique() |
Exploratory Data Analysis | View unique values in a Series |
pd.Series.value_counts() |
Exploratory Data Analysis | View the number of times each unique value appears in a Series |
df[col] |
Tabular Data and pandas | From DataFrame df, return column col as a Series |
df[[col]] |
Tabular Data and pandas | From DataFrame df, return column col as a DataFrame |
df.loc[row, col] |
Tabular Data and pandas | From DataFrame df, return rows with index name row and column name col; row can alternatively be a boolean Series |
df.iloc[row, col] |
Tabular Data and pandas | From DataFrame df, return rows with index number row and column number col; row can alternatively be a boolean Series |
pd.DataFrame.isnull()pd.Series.isnull() |
Data Cleaning | View missing values in a DataFrame or Series |
pd.DataFrame.fillna(value)pd.Series.fillna(value) |
Data Cleaning | Fill in missing values in a DataFrame or Series with value |
pd.DataFrame.dropna(axis)pd.Series.dropna() |
Data Cleaning | Drop rows or columns with missing values from a DataFrame or Series |
pd.DataFrame.drop(labels, axis) |
Data Cleaning | Drop rows or columns named labels from DataFrame along axis |
pd.DataFrame.rename() |
Data Cleaning | Rename specified rows or column in DataFrame |
pd.DataFrame.replace(to_replace, value) |
Data Cleaning | Replace to_replace values with value in DataFrame |
pd.DataFrame.reset_index(drop=False) |
Data Cleaning | Reset a DataFrame's indices; by default, retains old indices as a new column unless drop=True specified |
pd.DataFrame.sort_values(by, ascending=True) |
Tabular Data and pandas | Sort a DataFrame by specified columns by, in ascending order by default |
pd.DataFrame.groupby(by) |
Tabular Data and pandas | Return a GroupBy object that contains a DataFrame grouped by the values in the specified columns by |
GroupBy.<function> |
Tabular Data and pandas | Apply a function <function> to each group in a GroupBy object GroupBy; e.g. mean(), count() |
pd.Series.<function> |
Tabular Data and pandas | Apply a function <function> to a Series with numerical values; e.g. mean(), max(), median() |
pd.Series.str.<function> |
Tabular Data and pandas | Apply a function <function> to a Series with string values; e.g. len(), lower(), split() |
pd.Series.dt.<property> |
Tabular Data and pandas | Extract a property <property> from a Series with Datetime values; e.g. year, month, date |
pd.get_dummies(columns, drop_first=False) |
--- | Convert categorical variables columns to dummy variables; default retains all variables unless drop_first=True specified |
pd.merge(left, right, how, on) |
Exploratory Data Analysis; Databases and SQL | Merge two DataFrames left and right together on specified columns on; type of join depends on how |
pd.read_sql(sql, con) |
Databases and SQL | Read a SQL query sql on a database connection con, and return result as a pandas DataFrame |
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/20'))
| Function | Chapter | Description |
|---|---|---|
sns.lmplot(x, y, data, fit_reg=True) |
Data Visualization | Create a scatterplot of x versus y from DataFrame data, and by default overlay a least-squares regression line |
sns.distplot(a, kde=True) |
Data Visualization | Create a histogram of a, and by default overlay a kernel density estimator |
sns.barplot(x, y, hue=None, data, ci=95) |
Data Visualization | Create a barplot of x versus y from DataFrame data, optionally factoring data based on hue, and by default drawing a 95% confidence interval (which can be turned off with ci=None) |
sns.countplot(x, hue=None, data) |
Data Visualization | Create a barplot of value counts of variable x chosen from DataFrame data, optionally factored by categorical variable hue |
sns.boxplot(x=None, y, data) |
Data Visualization | Create a boxplot of y, optionally factoring by categorical variables x, from the DataFrame data |
sns.kdeplot(x, y=None) |
Data Visualization | If y=None, create a univariate density plot of x; if y is specified, create a bivariate density plot |
sns.jointplot(x, y, data) |
Data Visualization | Combine a bivariate scatterplot of x versus y from DataFrame data, with univariate density plots of each variable overlaid on the axes |
sns.violinplot(x=None, y, data) |
Data Visualization | Draws a combined boxplot and kernel density estimator of variable y, optionally factored by categorical variable x, chosen from DataFrame data |
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/20'))
| Function | Chapter | Description |
|---|---|---|
plt.scatter(x, y) |
Data Visualization | Creates a scatter plot of the variable x against the variable y |
plt.plot(x, y) |
Data Visualization | Creates a line plot of the variable x against the variable y |
plt.hist(x, bins=None) |
Data Visualization | Creates a histogram of x. Bins argument can be an integer or sequence |
plt.bar(x, height) |
Data Visualization | Creates a bar plot. x specifies x-coordinates of bars, height specifies heights of the bars |
plt.axvline(x=0) |
Data Visualization | Creates a vertical line at the x value specified |
plt.axhline(y=0) |
Data Visualization | Creates a horizontal line at the y value specified |
| Function | Chapter | Description |
|---|---|---|
%matplotlib inline |
Data Visualization | Causes output of plotting commands to be displayed inline |
plt.figure(figsize=(3, 5)) |
Data Visualization | Creates a figure with a width of 3 inches and a height of 5 inches |
plt.xlim(xmin, xmax) |
Data Visualization | Sets the x-limits of the current axes |
plt.xlabel(label) |
Data Visualization | Sets an x-axis label of the current axes |
plt.title(label) |
Data Visualization | Sets a title of the current axes |
plt.legend(x, height) |
Data Visualization | Places a legend on the axes |
fig, ax = plt.subplots() |
Data Visualization | Creates a figure and set of subplots |
plt.show() |
Data Visualization | Displays a figure |
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/20'))
| Import | Function | Section | Description |
|---|---|---|---|
sklearn.model_selection |
train_test_split(*arrays, test_size=0.2) |
Modeling and Estimation | Returns two random subsets of each array passed in, with 0.8 of the array in the first subset and 0.2 in the second subset |
sklearn.linear_model |
LinearRegression() |
Modeling and Estimation | Returns an ordinary least squares Linear Regression model |
sklearn.linear_model |
LassoCV() |
Modeling and Estimation | Returns a Lasso (L1 Regularization) linear model with picking the best model by cross validation |
sklearn.linear_model |
RidgeCV() |
Modeling and Estimation | Returns a Ridge (L2 Regularization) linear model with picking the best model by cross validation |
sklearn.linear_model |
ElasticNetCV() |
Modeling and Estimation | Returns a ElasticNet (L1 and L2 Regularization) linear model with picking the best model by cross validation |
sklearn.linear_model |
LogisticRegression() |
Modeling and Estimation | Returns a Logistic Regression classifier |
sklearn.linear_model |
LogisticRegressionCV() |
Modeling and Estimation | Returns a Logistic Regression classifier with picking the best model by cross validation |
Assuming you have a model variable that is a scikit-learn object:
| Function | Section | Description |
|---|---|---|
model.fit(X, y) |
Modeling and Estimation | Fits the model with the X and y passed in |
model.predict(X) |
Modeling and Estimation | Returns predictions on the X passed in according to the model |
model.score(X, y) |
Modeling and Estimation | Returns the accuracy of X predictions based on the corect values (y) |
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/21'))
We thank Joe Hellerstein, Bin Yu, and Fernando Perez for their significant efforts towards building the first iterations of Data 100.
This textbook also contains substantial contributions from past Data 100 students. We list the contributors below and thank them for their effort in creating content for the textbook.
| Name | Contributions |
|---|---|
| Ananth Agarwal | 9.1 (Relational Databases), 9.2 (SQL Queries), 15.3 (Cross-Validation) |
| Ashley Chien | 9.3 (SQL Joins), Reference Table Appendix |
| Andrew Do | 8.1 (Python String Methods), 8.2 (Regular Expressions), 8.3 (Regex in Python and pandas) |
| Sona Jeswani | 18.1 (Introduction to Hypothesis Testing), 18.2 (Permutation Testing), 18.3 (Bootstrapping for Linear Regression) |
| Tiffany Jann | 7.1 (HTTP), 11.3 (Convexity) |
| Andrew Kim | 7.1 (HTTP), 8.1 (Python String Methods), 8.2 (Regular Expressions), 8.3 (Regex in Python and pandas) |
| Jun Seo Park | 9.1 (Relational Databases), 9.2 (SQL Queries), Reference Table Appendix |
| Allen Shen | 2.2 (Probability Overview), 11.3 (Convexity), 12 (Probability and Generalization), 15 (Bias-Variance Tradeoff) |
| Katherine Yen | 9.3 (SQL Joins), 15.3 (Cross-Validation) |
| Daniel Zhu | 2.2 (Probability Overview), 8 (Working with Text), 12 (Probability and Generalization), 15 (Bias-Variance Tradeoff) |